pdftotext ignores words for no discernable reason

DGollings commented 10 months ago

Hello,

We've been using a variant of pdftotext++ with a LLM in order to parse invoices, can confirm it does a much better job 'grouping' text together for better processing.

But unfortunately have had a strange issue with v0.0.3 where it 'drops' certain words for no reason we have been able to figure out, whilst a (much older) version works without issue

This docker image works: adfreiburg/pdftotext

But the v0.0.03 binary does not

I was going to redact the personally identifiable information and add the pdf here but changing that fixes the issue, so its likely a layout issue. Would you be interested in having a look? If so, I can e-mail the original file

kwakwaversal commented 3 days ago

I have come across the same problem. I don't know why it is happening, but commenting out the if block at https://github.com/ad-freiburg/pdftotext-plus-plus/blob/cca94e9f3e80c5df91847394d353f9af7808fb3a/src/PdfParsing.cpp#L535-L543 so that it just adds the character to the page fixes it. I understand that this is not the correct solution.

It appears that pdftotext++ is recognizing individual characters within PDFs but is not successfully extracting the complete text as expected, especially for PDFs produced by wkhtmltopdf in my case.

Did not work: https://argos-support.co.uk/instruction-manual/2002721-lg-43-inch-43uq75006lf-smart-4k-uhd-hdr-led-freeview-tv.pdf

$ pdfinfo 2002721_D001.pdf
Title:
Creator:        wkhtmltopdf 0.12.6
Producer:       Qt 4.8.7
CreationDate:   Wed Sep 21 00:47:49 2022 BST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          150
Encrypted:      no
Page size:      595 x 842 pts (A4)
Page rot:       0
File size:      3066222 bytes
Optimized:      no
PDF version:    1.4

Worked fine: https://argos-support.co.uk/instruction-manual/4840754-bush-32-inch-hd-eled-tv-hd-ready.pdf

$ pdfinfo 4840754.pdf
Creator:        Adobe InDesign CC 14.0 (Windows)
Producer:       Adobe PDF Library 15.0
CreationDate:   Mon May 25 11:56:23 2020 BST
ModDate:        Mon May 25 11:56:34 2020 BST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          50
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      6979908 bytes
Optimized:      yes
PDF version:    1.4

kwakwaversal commented 2 days ago

During further investigation, I identified that pdftotext++ incorrectly appends characters to figures instead of the main page content when processing PDFs generated by wkhtmltopdf. This misclassification stems from discrepancies between the character's clipbox and the page's clipbox.

Detailed Explanation

As I understand it, a clipbox defines a rectangular area within a PDF where content (text, images, etc.) is allowed to appear. It essentially acts as a boundary for rendering content.

Observed Behaviour:

Working PDFs (e.g., Adobe InDesign):
- Character Clipbox: Matches the page's clipbox within a defined tolerance (COORDS_EQUAL_TOLERANCE).
- Result: Characters are correctly appended to the page's character list.
```
Page clipbox: leftX: 0; upperY: 0; rightX: 595.276; lowerY: 841.89
└─ clipbox: leftX: 0; upperY: 0; rightX: 595.276; lowerY: 841.89
Append to page 7.
```

Non-Working PDFs (e.g., wkhtmltopdf):

Character Clipbox: Does not match the page's clipbox.

Result: Characters are incorrectly appended to a figure instead of the page.

Page clipbox: leftX: 0; upperY: 0; rightX: 595; lowerY: 842
└─ clipbox: leftX: 43.5; upperY: 88.1366; rightX: 551.25; lowerY: 798.549
Append to figure figure-3XvIjyWO.

Root Cause Analysis

Clipbox Equality Check:
- The existing logic in pdftotext++ performs an exact equality check (within a small tolerance) between the character's clipbox and the page's clipbox.
- wkhtmltopdf PDFs often have different clipbox dimensions for content areas compared to the full page size, leading to failed equality checks.
Impact of Mismatch:
- Due to the mismatch, characters intended to be part of the main page content are misclassified as part of figures, resulting in incomplete or missing text extraction.

ad-freiburg / pdftotext-plus-plus

pdftotext ignores words for no discernable reason #30

Detailed Explanation

Root Cause Analysis