Upside-down text test failing incorrectly on Linux - Githubissues

andrewferrier / email2pdf

Script to convert emails to PDF from the command-line, as well as detach recognized attachments. Helps to process incoming emails and assist automatically with a non-paper paperwork workflow. Designed to work in tandem with getmail to convert forwarded emails to PDF automatically.

MIT License

69 stars 35 forks source link

Upside-down text test failing incorrectly on Linux #58

Open andrewferrier opened 9 years ago

andrewferrier commented 9 years ago

FAIL: test_plaincontent_upsidedown (tests.test_Subprocess_Basic.TestBasic)

Traceback (most recent call last): File "/home/ferriera/gitco/github/email2pdf/tests/test_Subprocess_Basic.py", line 74, in test_plaincontent_upsidedown self.assertRegex(self.getPDFText(self.getTimedFilename()), "ɯɐɹƃoɹd ɟpdᄅlᴉɐɯǝ ǝɥʇ ɟo ʇsǝʇ ɐ sᴉ sᴉɥʇ ollǝH") AssertionError: Regex didn't match: 'ɯɐɹƃoɹd ɟpdᄅlᴉɐɯǝ ǝɥʇ ɟo ʇsǝʇ ɐ sᴉ sᴉɥʇ ollǝH' not found in 'ɯɐɹƃoɹd ɟpd lᴉɐɯǝ ǝɥʇ ɟo ʇsǝʇ ɐ sᴉ sᴉɥʇ ollǝH\n\n\x0c'

andrewferrier commented 9 years ago

Doesn't seem to work if you use /usr/local/bin/pdf2txt.py either; it may just be that pdfminer3k is broken in this respect. Worth trying pypdf2? http://stackoverflow.com/questions/15737806/extract-text-using-pdfminer-and-pypdf2-merges-columns (although that doesn't seem to extract any text at all).

andrewferrier commented 9 years ago

Fails on both portland and in Docker.

andrewferrier commented 9 years ago

Consider calling out to http://en.wikipedia.org/wiki/Pdftotext ? Is that supported on OS X?

andrewferrier commented 9 years ago

Experimenting with this issue in branch issue-58:

https://github.com/andrewferrier/email2pdf/tree/issue-58

andrewferrier commented 8 years ago

Another option is ebook-convert from Calibre: http://askubuntu.com/a/56400/728

andrewferrier commented 8 years ago

The basic issue is that the upside-down '2', namely ᄅ, is not being extracted correctly and is being replaced with a space (there is also some extra whitespace at the end of the extracted string). Here's the char in question: http://unicodelookup.com/#ᄅ/1

andrewferrier commented 8 years ago

This could be generation of the PDF, not reading: http://stackoverflow.com/a/28694708/27641

andrewferrier commented 8 years ago

Option to try; use html-pdf to generate? https://www.npmjs.com/package/html-pdf