Open andrewferrier opened 9 years ago
Doesn't seem to work if you use /usr/local/bin/pdf2txt.py either; it may just be that pdfminer3k is broken in this respect. Worth trying pypdf2? http://stackoverflow.com/questions/15737806/extract-text-using-pdfminer-and-pypdf2-merges-columns (although that doesn't seem to extract any text at all).
Fails on both portland and in Docker.
Consider calling out to http://en.wikipedia.org/wiki/Pdftotext ? Is that supported on OS X?
Experimenting with this issue in branch issue-58:
Another option is ebook-convert from Calibre: http://askubuntu.com/a/56400/728
The basic issue is that the upside-down '2', namely ᄅ, is not being extracted correctly and is being replaced with a space (there is also some extra whitespace at the end of the extracted string). Here's the char in question: http://unicodelookup.com/#ᄅ/1
This could be generation of the PDF, not reading: http://stackoverflow.com/a/28694708/27641
Option to try; use html-pdf to generate? https://www.npmjs.com/package/html-pdf
FAIL: test_plaincontent_upsidedown (tests.test_Subprocess_Basic.TestBasic)
Traceback (most recent call last): File "/home/ferriera/gitco/github/email2pdf/tests/test_Subprocess_Basic.py", line 74, in test_plaincontent_upsidedown self.assertRegex(self.getPDFText(self.getTimedFilename()), "ɯɐɹƃoɹd ɟpdᄅlᴉɐɯǝ ǝɥʇ ɟo ʇsǝʇ ɐ sᴉ sᴉɥʇ ollǝH") AssertionError: Regex didn't match: 'ɯɐɹƃoɹd ɟpdᄅlᴉɐɯǝ ǝɥʇ ɟo ʇsǝʇ ɐ sᴉ sᴉɥʇ ollǝH' not found in 'ɯɐɹƃoɹd ɟpd lᴉɐɯǝ ǝɥʇ ɟo ʇsǝʇ ɐ sᴉ sᴉɥʇ ollǝH\n\n\x0c'