euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.24k stars 1.13k forks source link

pdf2txt.py parsing text out of order #121

Open jrussell999 opened 8 years ago

jrussell999 commented 8 years ago

I have used pdf2txt.py to create both a .txt a .html file of this pdf, and they both have the same problem: Some lines of text appear out of order. http://www.filehosting.org/file/details/510581/american_samoa_copy_for_experimenting.pdf

For example, this is the correct layout of one section of page one of the pdf:

Documents Required

Copy of Passport (some ports require Passports for all family members listed on the 3299) Form CF-3299 Supplemental Declaration (required by most ports) Detailed inventory in English Copy of Visa (if non-US citizen / permanent resident) / copy of Permanent Resident Card I-94 Stamp / Card Copy of Bill of Lading (OBL) / Air Waybill (AWB) Form DS-1504 (Diplomats) A-1 Visa (Diplomats) Importers Security Filing (ISF)

And this is how it comes out with both the txt and the html conversion using pdf2txt.py:

Documents Required

Copy of Passport (some ports require Passports for all family members listed on the 3299) Form CF-3299 Supplemental Declaration (required by most ports) Detailed inventory in English Copy of Visa (if non-US citizen / permanent resident) / copy of Permanent Resident Card

Copy of Bill of Lading (OBL) / Air Waybill (AWB) Form DS-1504 (Diplomats) A-1 Visa (Diplomats)

Importers Security Filing (ISF) I-94 Stamp / Card

The lines beginning with the letter "I" are always taken from their place in the text and placed on the next blank line, or sometimes a previous blank line. For some reason it always happens to lines beginning with the letter "I". There are some other characters it happens to as well.

It seems like this might be related to https://github.com/euske/pdfminer/issues/82

jrussell999 commented 8 years ago

I found the original site of my example pdf. It's the American Samoa one on this page: https://www.iamovers.org/ResourcesPublications/ShipperGuides.aspx?navItemNumber=580

I didn't realize that file hosting site asked people for their emails.

ghost commented 7 years ago

I have the same issue. Attached is a sample page.

The related excerpt is:

Uhr verließ, war er immer noch so voller Sorge, dass er beim ersten Schritt Jemandem zusammenprallte.

gleich mit

nach

draußen