deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.86k stars 592 forks source link

`UnboundLocalError: local variable 'pipe' referenced before assignment` #256

Open SatyaRamGV opened 5 years ago

SatyaRamGV commented 5 years ago

text = textract.process(file, method='pdfminer')

Error: UnboundLocalError Traceback (most recent call last)

in () ----> 1 text = textract.process(file, method='pdfminer') ~/.local/lib/python3.6/site-packages/textract/parsers/__init__.py in process(filename, encoding, extension, **kwargs) 75 76 parser = filetype_module.Parser() ---> 77 return parser.process(filename, encoding, **kwargs) 78 79 ~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in process(self, filename, encoding, **kwargs) 44 # output encoding 45 # http://nedbatchelder.com/text/unipain/unipain.html#35 ---> 46 byte_string = self.extract(filename, **kwargs) 47 unicode_string = self.decode(byte_string) 48 return self.encode(unicode_string, encoding) ~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract(self, filename, method, **kwargs) 29 30 elif method == 'pdfminer': ---> 31 return self.extract_pdfminer(filename, **kwargs) 32 elif method == 'tesseract': 33 return self.extract_tesseract(filename, **kwargs) ~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract_pdfminer(self, filename, **kwargs) 46 def extract_pdfminer(self, filename, **kwargs): 47 """Extract text from pdfs using pdfminer.""" ---> 48 stdout, _ = self.run(['pdf2txt.py', filename]) 49 return stdout 50 ~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in run(self, args) 94 # pipe.wait() ends up hanging on large files. using 95 # pipe.communicate appears to avoid this issue ---> 96 stdout, stderr = pipe.communicate() 97 98 # if pipe is busted, raise an error (unlike Fabric) `UnboundLocalError: local variable 'pipe' referenced before assignment` _Originally posted by @SatyaRamGV in https://github.com/deanmalmgren/textract/issue_comments#issuecomment-439043876_
olivx commented 5 years ago

I'm need extract many pdf and i have same problem ... you did fix it ? what's solution you choice ?

absingh2019 commented 5 years ago

I have the same problem .do you have a solution for it.

karlrobertjanicki commented 5 years ago

Have you tried to run as sudo? Solved it for me

jpweytjens commented 5 years ago

@SatyaRamGV can you try textract 1.6.2? I can't reproduce this issue on my end.

SatyaRamGV commented 5 years ago

@SatyaRamGV can you try textract 1.6.2? I can't reproduce this issue on my end.

This is error is with 1.6.1

I think it is sloved in 1.6.2, but v1.6.2 is not available as PyPI package...you should install from git repo

jpweytjens commented 5 years ago

I'm closing this issue due to inactivity. If you still encounter the issue with the latest version of textract, feel free to leave a comment with additional information and I'll reopen the issue.

ewerkema commented 4 years ago

Same error in textract 1.6.3 on Linux from a Docker container. This error doesn't occur locally (on Windows). Maybe related to this issue on Stackoverflow.

2019-09-21T12:50:41.552392889Z Traceback (most recent call last):
2019-09-21T12:50:41.552428789Z   File "/app/src/processors/document2text.py", line 32, in process
2019-09-21T12:50:41.552441289Z     text = textract.process(document.path)
2019-09-21T12:50:41.552451789Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/__init__.py", line 77, in process
2019-09-21T12:50:41.552462289Z     return parser.process(filename, encoding, **kwargs)
2019-09-21T12:50:41.552472189Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/utils.py", line 46, in process
2019-09-21T12:50:41.552482389Z     byte_string = self.extract(filename, **kwargs)
2019-09-21T12:50:41.552492189Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py", line 20, in extract
2019-09-21T12:50:41.552502389Z     return self.extract_pdftotext(filename, **kwargs)
2019-09-21T12:50:41.552512089Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py", line 43, in extract_pdftotext
2019-09-21T12:50:41.552522089Z     stdout, _ = self.run(args)
2019-09-21T12:50:41.552531989Z   File "/usr/local/lib/python3.6/site-packages/textract/parsers/utils.py", line 96, in run
2019-09-21T12:50:41.552549489Z     stdout, stderr = pipe.communicate()
2019-09-21T12:50:41.552950690Z UnboundLocalError: local variable 'pipe' referenced before assignment
jpweytjens commented 4 years ago

@ewerkema Thanks for the Stackoverflow link. I have no experience with Docker, but I did find this issue which might be related. Can you comment if this is the same issue?

Textract relies on the external command line tool pdftotext. Is this available in your Docker container? If it isn't available, textract catches the error and falls back on the python module pdfminer to process the pdf file. I think Docker might be raising a different kind error that we don't check for.

ewerkema commented 4 years ago

@jpweytjens It was actually a memory problem of the Docker container. Due to insufficient memory the operation of pdftotext failed, causing the UnboundLocalError. So by following the installation instructions for the system packages using the apt-get package manager and increasing the memory solved the issue for me.

ghost commented 4 years ago

I think I know where this comes from: this bit of code in ShellParser:

        # run a subprocess and put the stdout and stderr on the pipe object
        try:
            pipe = subprocess.Popen(
                args,
                stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
        except OSError as e:
            if e.errno == errno.ENOENT:
                # File not found.
                # This is equivalent to getting exitcode 127 from sh
                raise exceptions.ShellError(
                    ' '.join(args), 127, '', '',
                )

...coupled with forking issues on Unix: https://stackoverflow.com/questions/5306075/python-memory-allocation-error-using-subprocess-popen

Since the out-of-memory error is an OSError, it gets caught in the except block, but then eaten; the program tries to continue but since the assignment to pipe failed, it's not defined, hence the error message.

This could be alleviated by adding a bare raise after the errno check, at least to make it clearer what the actual error is. I could submit a PR if necessary?

VenkateshDharavath commented 3 years ago

@SatyaRamGV I tried with versions textract==1.6.1, textract==1.6.2, textract==1.6.3. All these versions throw this error. I'm on my windows 10. I have enough memory to perform this task, still, I get the same error.

Traceback (most recent call last):

File "", line 1, in text = textract.process(r"C:..\docs\Mortgage Security Agreement\Closed End PA MTG 5000.39.pdf", method='pdfminer')

File "C:..\venv\lib\site-packages\textract\parsers__init__.py", line 77, in process return parser.process(filename, encoding, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\utils.py", line 46, in process byte_string = self.extract(filename, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract return self.extract_pdfminer(filename, **kwargs)

File "C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extractpdfminer stdout, = self.run(['pdf2txt.py', filename])

File "C:..\venv\lib\site-packages\textract\parsers\utils.py", line 96, in run stdout, stderr = pipe.communicate()

UnboundLocalError: local variable 'pipe' referenced before assignment

nateGeorge commented 3 years ago

I had this problem when trying to read .doc files because I didn't have antiword properly installed. If you are on windows 10 and are trying to read .doc files, you need antiword from here: https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml

https://stackoverflow.com/a/51727238/4549682

PPGHPP commented 3 years ago

I have exactly the same error as
VenkateshDharavath reported on Nov 5, 2020 . I'm on Windows 10 and have enough memory and latest installations.

traverseda commented 3 years ago

@PPGHPP I've made some changes that should make the actual error clearer, they're not deployed yet though. Can you try installing from master?

It should be a command like pip install git+https://github.com/deanmalmgren/textract.git, although I'm not sure how you installed it on windows.

PPGHPP commented 3 years ago

Hi, Thank you for your information. I did pip install as you asked. Only ERROR was: " ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts." It did also this: "Successfully installed pdfminer.six-20191110" Now I'm able to use it like text=textract.process("tacl_a_00344.pdf"), and the result looks OK. Thanks again! BR PirkkoP

ti 17. elok. 2021 klo 15.29 traverseda @.***) kirjoitti:

@PPGHPP https://github.com/PPGHPP I've made some changes that should make the actual error clearer, they're not deployed yet though. Can you try installing from master?

It should be a command like pip install git+ https://github.com/deanmalmgren/textract.git, although I'm not sure how you installed it on windows.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deanmalmgren/textract/issues/256#issuecomment-900254863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGC2VVZFJ2BVSSMPZYQD2YLT5JIZRANCNFSM4GD7JIWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

PPGHPP commented 3 years ago

Hi again, One thing I noticed. Sentence comes with textract like this: "We define improvement as the quantity\r\nmax{0, fa \xe2\x88\x92 fb }, where b is our current..." BUT OCR-based pytesseract makes it "We define improvement as the quantity\r\nmax{0, fa — fy}. where b is our current ..." From p.764 of the attachment. BR PirkkoP

ti 17. elok. 2021 klo 21.14 Pirkko Pietiläinen @.***) kirjoitti:

Hi, Thank you for your information. I did pip install as you asked. Only ERROR was: " ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts." It did also this: "Successfully installed pdfminer.six-20191110" Now I'm able to use it like text=textract.process("tacl_a_00344.pdf"), and the result looks OK. Thanks again! BR PirkkoP

ti 17. elok. 2021 klo 15.29 traverseda @.***) kirjoitti:

@PPGHPP https://github.com/PPGHPP I've made some changes that should make the actual error clearer, they're not deployed yet though. Can you try installing from master?

It should be a command like pip install git+ https://github.com/deanmalmgren/textract.git, although I'm not sure how you installed it on windows.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/deanmalmgren/textract/issues/256#issuecomment-900254863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGC2VVZFJ2BVSSMPZYQD2YLT5JIZRANCNFSM4GD7JIWA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

traverseda commented 3 years ago

I think that probably has something to do with chardet. The next release should help.