JoshData / pdf-diff

A PDF comparison utility in Python.
Creative Commons Zero v1.0 Universal
453 stars 66 forks source link

lxml error #6

Open pimpampoum opened 8 years ago

pimpampoum commented 8 years ago

Hello,

I got this error using your pdf-diff.py : any idea ?

$ ../VISA_III/pdf-diff.py ../VISA_III/visa_iii.pdf ../VISA_III/visa_iii_old.pdf > diff_visa_iii_iv.png Traceback (most recent call last): File "../VISA_III/pdf-diff.py", line 456, in changes = compute_changes(left_file, right_file, top_margin=top_margin) File "../VISA_III/pdf-diff.py", line 9, in compute_changes docs = [serialize_pdf(0, pdf_fn_1, top_margin), serialize_pdf(1, pdf_fn_2, top_margin)] File "../VISA_III/pdf-diff.py", line 24, in serialize_pdf for run in box_generator: File "../VISA_III/pdf-diff.py", line 84, in mark_eol_hyphens for next_box in boxes: File "../VISA_III/pdf-diff.py", line 57, in pdf_to_bboxes dom = etree.fromstring(xml) File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:82934) File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:124533) File "src/lxml/parser.pxi", line 1707, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:123074) File "src/lxml/parser.pxi", line 1079, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:117114) File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:110510) File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:112276) File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:111124) lxml.etree.XMLSyntaxError: PCDATA invalid Char value 7, line 10432, column 81

JoshData commented 8 years ago

Looks like a character encoding issue, either because pdftotext and subprocess.check_output aren't using the same encoding or etree.fromstring isn't quite the right way to load XML.

pimpampoum commented 8 years ago

Thanks. Well, I'm affraid you're right. There are plenty of maths formulas that pdftotext can't deal with.

JoshData commented 8 years ago

Ideally this module wouldn't crash in those cases, so something probably can be fixed (although I don't have time to try myself).

neiljp commented 7 years ago

FYI I ran into a similar issue during the 2017 mozilla global sprint, where I used this library, and have a potential patch/PR to fix this. Do you have any interest in that?

neiljp commented 7 years ago

See PR #12 above, which should resolve this.