deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.87k stars 592 forks source link

hangs on large PDFs #33

Closed thebenedict closed 10 years ago

thebenedict commented 10 years ago

I'm seeing the Python package hang on large PDFs. To reproduce, use a large pdf like this World Bank Annual Report (pdf) and run textract.process('filename.pdf'). I get no output but the command doesn't complete.

Popen.wait() in shell.py is filling up the OS buffer. Switching to Popen.communicate() as suggested in the link above fixes the issue, but since communicate() returns the output (and error) text directly, I had to modify shell and the pdf parser to return text instead of a pipe object.

Happy to submit a pull request, but is it wise to change the return value of shell (and therefore need to change every parser)? Is there something else to return that we can call sdtout.read() on?

FWIW, this post suggests using tempfile.TemporaryFile() to work around the buffer issue, but I couldn't figure out how to call pipe.stdout.read() with the tempfile.

christomitov commented 10 years ago

@thebenedict Found this on the issue: http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output

Going to play with the solutions proposed by Glenn Maynard there and see what happens.

deanmalmgren commented 10 years ago

Well that's a bummer. Are you sure the problem is that python is hanging or is it just that extracting the text from PDFs takes a long time?

The reason I ask is that --- depending on how the PDF parser extracts text --- this can take a long time. e.g., https://github.com/euske/pdfminer/issues/61

If you just run pdftotext filename.pdf, is that fast? On Aug 6, 2014 10:57 PM, "Christo Mitov" notifications@github.com wrote:

@thebenedict https://github.com/thebenedict Found this on the issue: http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output

Going to play with the two solutions proposed by Glenn Maynard there and see what happens.

Reply to this email directly or view it on GitHub https://github.com/deanmalmgren/textract/issues/33#issuecomment-51425248 .

deanmalmgren commented 10 years ago

Also, for what its worth, I'm not opposed to having the shell.run function return a string instead of a pipe object. That choice wasn't...er hem...carefully made

(sorry I can't be more help; in the airport at the moment)

thebenedict commented 10 years ago

Fairly sure it's hanging. CPU is at 0 when it stops, and both pdftotext filename.pdf and pdf2text filename.pdf complete in a second or two in ipython shell.

I'll try to take another look at it later today. Assuming it's a buffer issue, a tempfile seems like a good approach.

deanmalmgren commented 10 years ago

{{kicks floor. cloud of dust drifts in the wind}}

Bummer.

christomitov commented 10 years ago

@deanmalmgren @thebenedict Alternatively we can just remove the .wait() call as waiting for the process to finish just to read the pipe anyway is not needed. I can open a PR if you like, I basically just removed the .wait(), and the returncode check and it doesn't hang. Ideally the returncode check will have to happen after the data has been read.

thebenedict commented 10 years ago

@christomitov @deanmalmgren Sounds good to me -- is there a branch where I can try it out? I'd like to see what that does when there actually is an error in the pdf.

christomitov commented 10 years ago

@thebenedict Here you go , to do the returncode check will probably for sure require a .poll() to check if the process has terminated.

EDIT: I think we're just going to have to switch to use .communicate() like you originally suggested.. there doesn't seem to be another way to retain exceptions based on return codes otherwise.

deanmalmgren commented 10 years ago

Thanks for looking into this. Sorry I've been neglecting it; I got busy the past few days and haven't had a chance to properly take a look.

All things being equal, I'd prefer to keep the return code check in the run function to make this as clean as possible. If it makes sense to refactor and have run return a string instead of a pipe object, I am 100% supportive of that change.

I'll try to take a closer look over the weekend and hopefully we can release a patch to fix this issue ASAP

thebenedict commented 10 years ago

Sweet, thanks @deanmalmgren. Good hanging with you both.