datamade / django-councilmatic

:heartpulse: Django app providing core functions for *.councilmatic.org
http://councilmatic.org
MIT License
26 stars 16 forks source link

Conversion script for LA Metro attachments #193

Closed reginafcompton closed 6 years ago

reginafcompton commented 6 years ago

This PR handles Metro issue https://github.com/datamade/la-metro-councilmatic/issues/266

The script should (1) convert .doc, .docx, and PDF attachments into plain text, and (2) save that text in the full_text field on the BillDocument model.

Unoconv can convert doc and .docx files. However, it cannot convert PDFs into plain text. See what LibreOffice can and cannot export.

For PDFs, we could use something like PyPDF2 in combination with the requests library.

Note: Since we need to use something in addition to unoconv, we might think about if we should use something other than unoconv for the doc and docx conversions (unoconv is a heavy dependency). The original scope mentions Excel, however - and unoconv could be good for those files....

reginafcompton commented 6 years ago

For PDF conversion, try this: https://www.binpress.com/tutorial/pdfrw-the-other-python-PDF-library/171 https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#4-standard-toolkit

hancush commented 6 years ago

for posterity: piping in2csv output to txt could work for excel files!

reginafcompton commented 6 years ago

PyPDF2 is easy to install and use. However, it comes with a couple drawbacks:

(1) I noticed several instances of plain text omitted spaces. This seems to be a known issue with PyPDF2: https://github.com/mstamy2/PyPDF2/issues/17

(2) PyPDF2 can only convert one page at a time (using extractText()).


I think pdfminer six might be the best option - it comes with a nice pdf2txt.py script, although the pip install does not work as expected. The documentation suggests downloading it from source.

reginafcompton commented 6 years ago

I ultimately landed on textract, since it converts pdf, doc, and docx files to plaintext, without the heft of unoconv or a second library. I did not have much difficulty installing it on MacOS, but I'd like to try it out on the Councilmatic server before confirming this solution: http://textract.readthedocs.io/en/stable/installation.html#ubuntu-debian

reginafcompton commented 6 years ago

Clarification regarding the abandonment of unoconv

unoconv struggles with converting PDF to txt. I tried these conversions both locally and on the Councilmatic server. For both, unoconv errors with "Unable to store document..." when calling storeToURL - a function in OpenOffice.

Server

# Command run 
# I also tried this with "text"
unoconv -f txt 8e6281f1-8342-42ae-b5a2-271ca6902d99.pdf

# Error
File "/usr/bin/unoconv", line 1118, in convert
    document.storeToURL(outputurl, tuple(outputprops) )
uno.IOException: SfxBaseModel::impl_store <file:///tmp/8e6281f1-8342-42ae-b5a2-271ca6902d99.txt> failed: 0xc10

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/unoconv", line 1389, in <module>
    main()
  File "/usr/bin/unoconv", line 1305, in main
    convertor.convert(inputfn)
  File "/usr/bin/unoconv", line 1120, in convert
    raise UnoException("Unable to store document to %s (ErrCode %d)\n\nProperties: %s" % (outputurl, e.ErrCode, outputprops), None)
  File "/usr/lib/python3/dist-packages/uno.py", line 507, in _uno_struct__getattr__
    return getattr(self.__dict__["value"], name)
AttributeError: ErrCode

It may be that we need a different version of OpenOffice, but I would rather not go down that path, considering that we are happily using Unoconv for the RTF converter.

reginafcompton commented 6 years ago

Installing textract on an Ubuntu server

I installed textract on the staging server (i.e., the server with all our staging sites, not the Councilmatic server), and it worked well. It requires several lightweight dependencies and one small install hack.

First, install all the dependencies.

Second, textract fails when installing pocketsphinx, which is not actually necessary for converting PDFs or word documents to txt. This issue provides a clever work around: https://github.com/deanmalmgren/textract/pull/178. Third, run pip install textract.

reginafcompton commented 6 years ago

@hancush - I have a working solution for the text conversion!

The one point that requires further thought entails the use of a NamedTemporaryFile in convert_document.

Ideally, we could do this without a temporary file, i.e., with a subprocess, since textract can be used as a CLI tool:

p = subprocess.Popen(['textract', '--stdin', '--stdout'], preexec_fn=os.setsid, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL)
plain_text, stderr_data = p.communicate(input=RESPONSE_IN_SOME_FORMAT, timeout=15)

However, textract uses a variety of dependencies to open and convert files: docx, doc, pdf

Is there a way to pass pdf, doc, and docx files to textract with a subprocess - and I am not seeing it?


@fgregg - could I bring you into this conversation as a consultant/reviewer? The main parts of this are (1) why I used textract, rather than unoconv (see comments above), and (2) the question of using a subprocess with --stdin rather than a TempFile (see notes directly above).

fgregg commented 6 years ago

Since unoconv remains a dependency, as we use it for converting the rtf files to html, what about just install pdftotext and just using that for pdfs and unoconv for the doc files.

That would seem to be a much smaller footprint?

reginafcompton commented 6 years ago

@fgregg, yes, indeed. One of my first implementations did just that (i.e., use Unoconv for doc and docx, then another tool for PDF) - see commented code. I am happy to revert back to something along these lines with pdftotext.

With that said...

I am not married to either solution! Just getting my thoughts out there.

fgregg commented 6 years ago

ideally, we are going to isolate each Councilmatic instance on unique servers: do we want to have unoconv on the Metro server for this little task? in that case, it seems like textract would be a smaller footprint, no?

This is a compelling argument, as unoconv is a big dependency. However textract has a lot of binary dependencies, it is very far from lightweight.

Okay, I'm okay with textrac.