Closed thebenedict closed 10 years ago
@thebenedict Found this on the issue: http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output
Going to play with the solutions proposed by Glenn Maynard there and see what happens.
Well that's a bummer. Are you sure the problem is that python is hanging or is it just that extracting the text from PDFs takes a long time?
The reason I ask is that --- depending on how the PDF parser extracts text --- this can take a long time. e.g., https://github.com/euske/pdfminer/issues/61
If you just run pdftotext filename.pdf
, is that fast?
On Aug 6, 2014 10:57 PM, "Christo Mitov" notifications@github.com wrote:
@thebenedict https://github.com/thebenedict Found this on the issue: http://stackoverflow.com/questions/1180606/using-subprocess-popen-for-process-with-large-output
Going to play with the two solutions proposed by Glenn Maynard there and see what happens.
Reply to this email directly or view it on GitHub https://github.com/deanmalmgren/textract/issues/33#issuecomment-51425248 .
Also, for what its worth, I'm not opposed to having the shell.run function return a string instead of a pipe object. That choice wasn't...er hem...carefully made
(sorry I can't be more help; in the airport at the moment)
Fairly sure it's hanging. CPU is at 0 when it stops, and both pdftotext filename.pdf
and pdf2text filename.pdf
complete in a second or two in ipython shell.
I'll try to take another look at it later today. Assuming it's a buffer issue, a tempfile seems like a good approach.
{{kicks floor. cloud of dust drifts in the wind}}
Bummer.
@deanmalmgren @thebenedict Alternatively we can just remove the .wait()
call as waiting for the process to finish just to read the pipe anyway is not needed. I can open a PR if you like, I basically just removed the .wait()
, and the returncode
check and it doesn't hang. Ideally the returncode
check will have to happen after the data has been read.
@christomitov @deanmalmgren Sounds good to me -- is there a branch where I can try it out? I'd like to see what that does when there actually is an error in the pdf.
@thebenedict Here you go , to do the returncode
check will probably for sure require a .poll()
to check if the process has terminated.
EDIT: I think we're just going to have to switch to use .communicate()
like you originally suggested.. there doesn't seem to be another way to retain exceptions based on return codes otherwise.
Thanks for looking into this. Sorry I've been neglecting it; I got busy the past few days and haven't had a chance to properly take a look.
All things being equal, I'd prefer to keep the return code check in the run
function to make this as clean as possible. If it makes sense to refactor and have run
return a string instead of a pipe object, I am 100% supportive of that change.
I'll try to take a closer look over the weekend and hopefully we can release a patch to fix this issue ASAP
Sweet, thanks @deanmalmgren. Good hanging with you both.
I'm seeing the Python package hang on large PDFs. To reproduce, use a large pdf like this World Bank Annual Report (pdf) and run
textract.process('filename.pdf')
. I get no output but the command doesn't complete.Popen.wait() in shell.py is filling up the OS buffer. Switching to
Popen.communicate()
as suggested in the link above fixes the issue, but since communicate() returns the output (and error) text directly, I had to modify shell and the pdf parser to return text instead of a pipe object.Happy to submit a pull request, but is it wise to change the return value of shell (and therefore need to change every parser)? Is there something else to return that we can call
sdtout.read()
on?FWIW, this post suggests using tempfile.TemporaryFile() to work around the buffer issue, but I couldn't figure out how to call
pipe.stdout.read()
with the tempfile.