drumadrian / Polly-Whitepapers

This project is intended to download PDFs of the AWS whitepapers, convert to audio, and publish.
0 stars 1 forks source link

Test PDF Miner #7

Open drumadrian opened 6 years ago

drumadrian commented 6 years ago

Use this tool to convert PDFs to text/XML.

Then strip all of the XML tags away and you will have raw text.

https://github.com/pdfminer/pdfminer.six

https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text

smyleeface commented 6 years ago

I didn't see anything about converting to xml, but in pull request #8, used their programming doc's basic usage code

More examples of how to use pdfminer http://denis.papathanasiou.org/archive/2010.08.04.post.pdf

drumadrian commented 6 years ago

Hey @smyleeface I should have been more specific.

They mentioned in on the README under "Command Line Tools"

dumppdf.py

dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (e.g. images).

It was only if the example was used.

The examples link you posted looks great. 🥇

smyleeface commented 6 years ago

Ah ok. Those two files pdf2txt.py and dumppdf.py are command line. I tried to get it to work and had the hardest time. I was having this issue, it doesn't work well with virtualenvs, and would be something we'd use if we write it in bash.

The documentation is lacking and am surprised to find what I did.

drumadrian commented 6 years ago

I don't mind using an EC2 instance to get a Bash runtime. :-) I think we could even get away with running a SPOT instance just for that and then stopping it. We might even be able to run packer to do the work and then shut down. LoL I always liked the idea of having a Auto Scaling Group filled with Spot instances pulling tasks off of an SQS Queue. I've set that up before using a Multi-Threaded Python application I downloaded from GitHub