deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 596 forks source link

The filename extension .pdf is not yet supported by textract #156

Closed MJK88 closed 7 years ago

MJK88 commented 7 years ago

I compiled a python script to an executable in which I want to read a pdf with textract. In my Python editor this works without problems. But when a execute the .exe file I get the warning: The filename extension .pdf is not yet supported by textract. Obviously, not true. Anybody knows how I can fix this?

DeastinY commented 7 years ago

I can't help, but I'm currently in roughly the same boat:

textract.exceptions.ExtensionNotSupported: The filename extension .pdf is not yet supported by
textract. Please suggest this filename extension here:

    https://github.com/deanmalmgren/textract/issues

Failed to execute script gui
QWaitCondition: Destroyed while threads are still waiting

I am building a Windows .exe and use pdftotext.exe which works fine when runnig the Python script, but I get the strange error above when running the executable created by pyinstaller

deanmalmgren commented 7 years ago

I'm sorry to hear you're having trouble extracting text from PDFs. Can you tell me a bit more about your setup (OS, textract version, etc)? It would also be helpful to see the actual command you are trying to run so I can try to reproduce.

Since you mentioned ".exe", I'm guessing you're on a Windows system. #155 looks like it is grappling with similar issues. My hunch is that your PATH (in unix, this variable controls where it looks for executables but I don't know the Windows equivalent) is different from within the python interpreter vs when you're using textract on the command line.

I have never actually installed textract on a Windows system—please feel free to add documentation about how you got it installed (see #111)!

DeastinY commented 7 years ago

Hey @deanmalmgren, thanks for the super fast reply ! I'm just confused by the The filename extension .pdf is not yet supported by textract - is this the output if textract is not set up properly ? I added the required stuff (pdftotext) to my PATH - nothing changed. If I get everything set up properly I will surely add some documentation, this is a great project !

deanmalmgren commented 7 years ago

That is the error message that is reported when a ExtensionNotSupported error is raised. I believe that error is only thrown here, when it tries to import the filename extension.

Can you share more about your system configuration and the verbatim code (or at least as close to it as possible) so I can help you diagnose?

DeastinY commented 7 years ago

What I tried so far:

The textract version I'm using is 1.5.0, Python 3.6.1 and dev branch of pyinstaller. For the source code I'm working on it's here - but currently still in a pretty bad shape. Basically all I'm doing is textract.process(str(file_pdf)).decode('utf-8')

deanmalmgren commented 7 years ago

hmmmm... this is really odd. I take it back; this has nothing to do with the PATH (or at least it doesn't appear that this is the problem). The error is thrown before its even getting to the PDF parser.

Can you try manipulating the textract/parsers/__init__.py file in your local version?

It appears that this line isn't working properly with your setup for some reason. It should be doing something like importlib.import_module('.pdf_parser', 'textract.parsers') but that appears to be throwing an ImportError. 🤞

DeastinY commented 7 years ago

Thanks a bunch for clarifying - was looking through the code but had very little idea of what it was actually doing 😄 I am not exactly sure how to manipulate those but I'll try out some things and report back.

I realized my file looks a lot different from the one in the repo. I updated (manually for now) and will see if that helps.

So I checked the module it's trying to import and it is actually .pdf_parser, so that seems to work.

Solution to initial problem : It actually was an issue with my understanding of pyinstaller. It - by default - does not include textracts parsers. To add these to the final package in the .spec file add the path of the parsers to pathex e.g.: pathex=['..\\GitHub\\srpdfcrawler', 'C:\\Users\\ric\\AppData\\Local\\Programs\\Python\\Python36\\Lib\\site-packages\\textract\\parsers'] and hiddenimports=["textract.parsers.pdf_parser"], for the hidden import.

DeastinY commented 7 years ago

Okay, so what I did to use Textract to extract PDF using pdftotext on Windows with Pyinstaller:

I only have very basic experience using pyinstaller, this might be a wrong/bad way to do it, but it works.

deanmalmgren commented 7 years ago

@DeastinY I don't know what happened when I uploaded textract to pypi, but I just realized that the v1.6.0 release never made it onto pypi for some reason. I just pushed v1.6.0 to pypi now. This is probably why your version of textract looked so different. My apologies for the problem there.

There are several fixes in 1.6.0, including #136 which improved the parser imports. I'm cautiously optimistic that this will help with some things.

I'm terribly sorry for the inconvenience. I thought I had released 1.6.0 but apparently it didn't fully upload to pypi or something :(

DeastinY commented 7 years ago

I'm terribly sorry for the inconvenience.

I'm super happy I can rely on such a great Open Source tool for my project. Your instant support and help with debugging was super helpful and I'm very happy I could at least help identify a failed push for 1.6.0 :D Once again tthank you so much for the effort you put into this great project and the provided suppert ! 💓

MJK88 commented 7 years ago

To follow up after your discussion on this problem. I used a similar setup. I use Python 2.7, and the compiler Py2exe, which works similar as pyinstaller. I installed textract on Windows with the command line: 'pip install textract', back then it was version 1.5. After your messages I tried to include 'textract.parsers.pdf_parser' similar as in the spec-file of pyinstaller but in the options file of Py2exe 'includes': ['textract.parsers.pdf_parser']. This didn't fix it. Then I updated textract to version 1.6.1, by using the command line: 'pip install textract --upgrade'. This fixed it for me. The compiled executable doesn't give the error anymore. I didn't test the dependency of the module inclusion yet. Thanks for the quick support and discussion!

deanmalmgren commented 7 years ago

Great; I'm glad we have this resolved. Thanks for the discussion everyone.

lrq3000 commented 6 years ago

Confirmed, to make textract work with pyinstaller, one has to both:

Nothing else is needed! Thanks a lot for the great job!

lrq3000 commented 6 years ago

Update: adding 'textract.parsers.pdf_parser' to hiddenimports will only allow for support of pdf files, but not any other type normally supported by Textract!

You can guess why: we need to add ALL parsers as hidden imports! Here is how I did it:

import textract
textract_all_parsers = list(os.walk(os.path.join(os.path.dirname(textract.__file__), 'parsers')))[0][2]
textract_all_parsers_imports = ['textract.parsers.' + os.path.splitext(parser)[0] for parser in textract_all_parsers]

a = Analysis([...,
hiddenimports=textract_all_parsers_imports,
...
])

An alternative way would be to use textract.parsers._get_available_extensions() but unfortunately it is currently raising an exception (at least for my platform), see #187.

BTW I implemented a GUI for textract: easytextract. This is why I wanted to build an executable :-) Thank you very much for all your hard work, the library is amazing!