pvfrota commented 3 years ago

Descrição

Um problema na lib PDFMiner.six (requerida pela lib PDFPlumber) está causando um alto uso de memória. Aparentemente, há um memory leak que faz com que a memória não seja liberada mesmo após deixarmos de ler uma página.

As soluções paliativas propostas para o problema (que envolvem reabrir o arquivo) se mostraram muito lentas e inefetivas no nosso caso, já que processamos sempre arquivos de tamanho grande (+3000 páginas).

Mais detalhes podem ser encontrados em:

https://github.com/pdfminer/pdfminer.six/issues/580 https://github.com/jsvine/pdfplumber/issues/193 https://github.com/jsvine/pdfplumber/issues/263

pvfrota commented 3 years ago

O que já foi tentado:

As soluções propostas aqui (https://github.com/jsvine/pdfplumber/issues/193#issuecomment-652300570) e aqui (https://github.com/jsvine/pdfplumber/issues/193#issuecomment-775558201), porém elas se mostraram lentas demais, pois requerem a reabertura do arquivo muitas vezes
A solução proposta aqui: https://github.com/CodeForManaus/vacina-manaus-backend/commit/210387fdff65e2424bd23f6c45cbcefdb888304e, que envolve a reabertura do arquivo quando o uso de memória chega a valores muito elevados.
- Essa solução "funcionou" em partes, pois em alguns casos a memória é liberada (mas não totalmente, apenas um pouco), e em alguns casos não.
- Para explicitar melhor, quando se trata do processamento do arquivo em si, parte de memória é liberada ao chamar o garbage collector (aqui no meu PC, o uso de memória caiu de 90% para 80%), mas quando se trata da análise do arquivo, isso não acontece.
- Inclusive é essa opção que continua mantendo a extração dos arquivos "funcionando", porém sem a parte de análise do arquivo, que está inviável de ser utilizada.
O uso de weakref/deleção do objeto com del, além de forçar o garbage collector a rodar
Testes de algumas outras libraries/programas para parsear PDF's, como a pdftotext (Python), camelot, tabula-py, pdftotext (XpdfReader). O melhor resultado obtido foi com a pdftotext (XpdfReader), porém o trabalho de parseamento do arquivo seria muito maior.

O que não foi tentado:

Ter um script shell externo controlando o processo, fechando e reabrindo o script python a cada vez que faltar memória RAM
Procurar outras libraries/programas além das citadas

jsvine commented 3 years ago

Hello, and I'm glad you opened this issue. I am the creator of pdfplumber, and I am happy to see the library being used here. The memory issues are unfortunate. I don't have control of those in pdfminer.six, but I think I can help with the memory issues in pdfplumber. Based on some testing, I think there's a more straightforward solution — one which does not require you to open and close the PDF multiple times:

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        del page._objects
        del page._layout

If this approach works, I'll aim to get a more convenient page-closing method into the next release.

jsvine commented 3 years ago

Update: Hah, I forgot that pdfplumber already has an (undocumented) way of doing this :)

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        page.flush_cache()

pvfrota commented 3 years ago

@jsvine thank you so much for the insights, and for taking the time to read the issue, even though it is in portuguese. I will try this solution and return with the feedback if it worked or not.

PT/BR: Muitíssimo obrigado pelos insights. Obrigado por ter tido o trabalho de ler a issue, mesmo estando em português. Vou tentar essa solução e lhe retorno com o feedback se deu certo ou não.

pvfrota commented 3 years ago

@jsvine, after implementing the solution, memory consumption stabilized in a low level when extracting data from PDF. I just used a different parameter in the flush_cache function, adding the_objects property (I don't know if it's really necessary)

https://github.com/CodeForManaus/vacina-manaus-backend/blob/4dbcdef3086a9af1a1b648094072b117cb198ae0/extract_data.py#L216-L268

However, there is a previous step in our processing, where we go through the entire PDF to detect the position of the columns in the file, and it still has problems with excessive memory consumption, even using the proposed solution.

https://github.com/CodeForManaus/vacina-manaus-backend/blob/4dbcdef3086a9af1a1b648094072b117cb198ae0/extract_data.py#L106-L167

In case you are interested in seeing the code and propose a more optimized way to be able to detect the position of the columns in the file, i am very grateful.

This is a PDF file processed by this code: https://github.com/CodeForManaus/vacina-manaus-backend/blob/master/data/raw/027_Vacinados_2021_02_11_20_00_00_TCE.pdf

PT/BR: @jsvine, após a implementação da solução, o consumo de memória se estabilizou em um nível baixo ao extrair dados do PDF. Apenas usei um parâmetro diferente na função flush_cache, adicionando a propriedade _objects (não sei se é realmente necessário)

Porém, existe uma etapa anterior em nosso processamento, onde percorremos todo o PDF para detectar a posição das colunas no arquivo, e ainda apresenta problemas com consumo excessivo de memória, mesmo utilizando a solução proposta.

Caso você tenha interesse em ver o código e propor uma forma mais otimizada de poder detectar a posição das colunas no arquivo, fico muito grato.

jsvine commented 3 years ago

Hi @pvfrota, and thanks for the update!

I just used a different parameter in the flush_cache function, adding the_objects property (I don't know if it's really necessary)

It should not be necessary, since _objects is already included here: https://github.com/jsvine/pdfplumber/blob/002727803c09fdd739ae4bac9090c3801c097cac/pdfplumber/container.py#L6

... but were you still experiencing the initial memory problems if you did not make that modification?

However, there is a previous step in our processing, where we go through the entire PDF to detect the position of the columns in the file, and it still has problems with excessive memory consumption, even using the proposed solution.

Hmmm, interesting. I can try to help but, unfortunately, I'm not very familiar with your codebase. Would you be able to share a standalone script that demonstrates this other memory issue? Or tell me what commands I should run to reproduce the issue?

pvfrota commented 3 years ago

Reabrindo essa issue pois ela não está completamente resolvida

CodeForManaus / vacina-manaus-backend

Alto uso de memória/memory leak #70

Descrição