bitextor / pdf-extract

PDF parser and converter to HTML
GNU General Public License v3.0
83 stars 14 forks source link

Over 2M of trash files produced while crawling #56

Closed mbanon closed 3 years ago

mbanon commented 3 years ago

Hi, I've been running Bitextor, branch snake_performance (mentioning @lpla in case he is needed here)

After it finished, I noticed that 2.162.294 (!!!) files, with names being nonsense-{numbers}.png, appeared in my crawling directory:

Captura de pantalla_2020-12-18_13-27-43

After some investigation, I found this line: https://github.com/bitextor/pdf-extract/blob/d4fe244408c55c1b881e62ccee75780e74930dda/src/pdfextract/PDFToHtml.java#L194 , that is suspicious...

This needs to be fixed asap...

mbanon commented 3 years ago

I ran another instance of Bitextor that created almost 9M files... Captura de pantalla_2020-12-20_11-51-34

mbanon commented 3 years ago

And the suspicious command , caught in htop: htop

(They appear here and there, each one running for a few seconds)

ramoelee commented 3 years ago

Hi @mbanon please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks

mbanon commented 3 years ago

Ok, I applied the fix and it's running now.

Will close when Bitextor finishes (with no trash files :) ) Thanks!