euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

Incorrect bounding boxes in xml output using forced analysis #17

Open jervispinto opened 12 years ago

jervispinto commented 12 years ago

Using the command: python pdf2txt.py -t xml -A

produces a verifiable error in bounding boxes.

(Please email me for the pdf)

danshultz commented 12 years ago

Hi @jervispinto -

What do you mean by verifiable error in bounding boxes? I'm going to be looking at the code for the bounding boxes soon as I noticed the Y coords are coming from the bottom of the document and not the top. The xcoord are correct

jervispinto commented 12 years ago

I'm attempting to recreate the error as it's been a while since I looked at this. I checked the xml again and the missing text seems to be in the xml with (as far as I can tell) correct bounding boxes but the text disappears during parsing. This may be an issue in my parsing logic so I'll double check over the weekend.

The Y coordinates are certainly decreasing with gravity.