documentcloud / docsplit

Break Apart Documents into Images, Text, Pages and PDFs
http://documentcloud.github.io/docsplit/
Other
832 stars 214 forks source link

diskspace leak when extracting text from pdf #151

Open KHMtravel opened 5 years ago

KHMtravel commented 5 years ago

I try to extract the text of this pdf https://gofile.io/?c=6U8qE8. I have a rack application inside a docker container running on Ubuntu 18.04.

After calling Docsplit.extract_text('spec/test.pdf', ocr: true, language: 'eng', output: 'spec/output.txt') I see the process gs uses the most cpu power and I lose 1GB of diskspace every 5 seconds until there is no space left.

Maybe someone has an idea what is going wrong here?

justinperkins commented 4 years ago

While investigating an issue with a long-running Docsplit job, which was on a PDF that contained no text, I ran into this same issue on my local dev machine. Rails app running on a vagrant instance running Ubuntu. After running for 10+ minutes, I ran out of disk space. Killed the job and restarted my host machine to get 40 GB back.