Closed reginafcompton closed 6 years ago
for posterity: piping in2csv
output to txt could work for excel files!
PyPDF2
is easy to install and use. However, it comes with a couple drawbacks:
(1) I noticed several instances of plain text omitted spaces. This seems to be a known issue with PyPDF2: https://github.com/mstamy2/PyPDF2/issues/17
(2) PyPDF2 can only convert one page at a time (using extractText()
).
I think pdfminer six might be the best option - it comes with a nice pdf2txt.py
script, although the pip install
does not work as expected. The documentation suggests downloading it from source.
I ultimately landed on textract
, since it converts pdf, doc, and docx files to plaintext, without the heft of unoconv or a second library. I did not have much difficulty installing it on MacOS, but I'd like to try it out on the Councilmatic server before confirming this solution: http://textract.readthedocs.io/en/stable/installation.html#ubuntu-debian
Clarification regarding the abandonment of unoconv
unoconv struggles with converting PDF to txt. I tried these conversions both locally and on the Councilmatic server. For both, unoconv errors with "Unable to store document..." when calling storeToURL - a function in OpenOffice.
Server
# Command run
# I also tried this with "text"
unoconv -f txt 8e6281f1-8342-42ae-b5a2-271ca6902d99.pdf
# Error
File "/usr/bin/unoconv", line 1118, in convert
document.storeToURL(outputurl, tuple(outputprops) )
uno.IOException: SfxBaseModel::impl_store <file:///tmp/8e6281f1-8342-42ae-b5a2-271ca6902d99.txt> failed: 0xc10
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/unoconv", line 1389, in <module>
main()
File "/usr/bin/unoconv", line 1305, in main
convertor.convert(inputfn)
File "/usr/bin/unoconv", line 1120, in convert
raise UnoException("Unable to store document to %s (ErrCode %d)\n\nProperties: %s" % (outputurl, e.ErrCode, outputprops), None)
File "/usr/lib/python3/dist-packages/uno.py", line 507, in _uno_struct__getattr__
return getattr(self.__dict__["value"], name)
AttributeError: ErrCode
It may be that we need a different version of OpenOffice, but I would rather not go down that path, considering that we are happily using Unoconv for the RTF converter.
Installing textract on an Ubuntu server
I installed textract on the staging server (i.e., the server with all our staging sites, not the Councilmatic server), and it worked well. It requires several lightweight dependencies and one small install hack.
First, install all the dependencies.
Second, textract fails when installing pocketsphinx
, which is not actually necessary for converting PDFs or word documents to txt
. This issue provides a clever work around: https://github.com/deanmalmgren/textract/pull/178.
Third, run pip install textract
.
@hancush - I have a working solution for the text conversion!
The one point that requires further thought entails the use of a NamedTemporaryFile
in convert_document
.
Ideally, we could do this without a temporary file, i.e., with a subprocess, since textract can be used as a CLI tool:
p = subprocess.Popen(['textract', '--stdin', '--stdout'], preexec_fn=os.setsid, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL)
plain_text, stderr_data = p.communicate(input=RESPONSE_IN_SOME_FORMAT, timeout=15)
However, textract uses a variety of dependencies to open and convert files: docx, doc, pdf
Is there a way to pass pdf, doc, and docx files to textract with a subprocess - and I am not seeing it?
@fgregg - could I bring you into this conversation as a consultant/reviewer? The main parts of this are (1) why I used textract
, rather than unoconv (see comments above), and (2) the question of using a subprocess with --stdin
rather than a TempFile (see notes directly above).
Since unoconv remains a dependency, as we use it for converting the rtf files to html, what about just install pdftotext and just using that for pdfs and unoconv for the doc files.
That would seem to be a much smaller footprint?
@fgregg, yes, indeed. One of my first implementations did just that (i.e., use Unoconv for doc and docx, then another tool for PDF) - see commented code. I am happy to revert back to something along these lines with pdftotext.
With that said...
I am not married to either solution! Just getting my thoughts out there.
ideally, we are going to isolate each Councilmatic instance on unique servers: do we want to have unoconv on the Metro server for this little task? in that case, it seems like textract would be a smaller footprint, no?
This is a compelling argument, as unoconv is a big dependency. However textract has a lot of binary dependencies, it is very far from lightweight.
Okay, I'm okay with textrac.
This PR handles Metro issue https://github.com/datamade/la-metro-councilmatic/issues/266
The script should (1) convert .doc, .docx, and PDF attachments into plain text, and (2) save that text in the
full_text
field on the BillDocument model.Unoconv can convert doc and .docx files. However, it cannot convert PDFs into plain text. See what LibreOffice can and cannot export.
For PDFs, we could use something like PyPDF2 in combination with the requests library.
Note: Since we need to use something in addition to unoconv, we might think about if we should use something other than unoconv for the
doc
anddocx
conversions (unoconv is a heavy dependency). The original scope mentions Excel, however - and unoconv could be good for those files....