Open drumadrian opened 6 years ago
I didn't see anything about converting to xml, but in pull request #8, used their programming doc's basic usage
code
More examples of how to use pdfminer http://denis.papathanasiou.org/archive/2010.08.04.post.pdf
Hey @smyleeface I should have been more specific.
They mentioned in on the README under "Command Line Tools"
dumppdf.py
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents (e.g. images).
It was only if the example was used.
The examples link you posted looks great. 🥇
Ah ok. Those two files pdf2txt.py
and dumppdf.py
are command line. I tried to get it to work and had the hardest time. I was having this issue, it doesn't work well with virtualenvs, and would be something we'd use if we write it in bash.
The documentation is lacking and am surprised to find what I did.
I don't mind using an EC2 instance to get a Bash runtime. :-) I think we could even get away with running a SPOT instance just for that and then stopping it. We might even be able to run packer to do the work and then shut down. LoL I always liked the idea of having a Auto Scaling Group filled with Spot instances pulling tasks off of an SQS Queue. I've set that up before using a Multi-Threaded Python application I downloaded from GitHub
Use this tool to convert PDFs to text/XML.
Then strip all of the XML tags away and you will have raw text.
https://github.com/pdfminer/pdfminer.six
https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text