Over 2M of trash files produced while crawling

bitextor / pdf-extract

PDF parser and converter to HTML

GNU General Public License v3.0

83 stars 14 forks source link

Closed mbanon closed 3 years ago

mbanon commented 3 years ago

Hi, I've been running Bitextor, branch snake_performance (mentioning @lpla in case he is needed here)

After it finished, I noticed that 2.162.294 (!!!) files, with names being nonsense-{numbers}.png, appeared in my crawling directory:

Captura de pantalla_2020-12-18_13-27-43

This needs to be fixed asap...

mbanon commented 3 years ago

I ran another instance of Bitextor that created almost 9M files... Captura de pantalla_2020-12-20_11-51-34

mbanon commented 3 years ago

And the suspicious command , caught in htop: htop

(They appear here and there, each one running for a few seconds)

ramoelee commented 3 years ago

Hi @mbanon please help to update the source code and reinstall PDFExtract.jar for resolve the issue.

Thanks

mbanon commented 3 years ago

Ok, I applied the fix and it's running now.

Will close when Bitextor finishes (with no trash files :) ) Thanks!