Open DGollings opened 10 months ago
I have come across the same problem. I don't know why it is happening, but commenting out the if
block at https://github.com/ad-freiburg/pdftotext-plus-plus/blob/cca94e9f3e80c5df91847394d353f9af7808fb3a/src/PdfParsing.cpp#L535-L543 so that it just adds the character to the page fixes it. I understand that this is not the correct solution.
It appears that pdftotext++
is recognizing individual characters within PDFs but is not successfully extracting the complete text as expected, especially for PDFs produced by wkhtmltopdf
in my case.
Did not work: https://argos-support.co.uk/instruction-manual/2002721-lg-43-inch-43uq75006lf-smart-4k-uhd-hdr-led-freeview-tv.pdf
$ pdfinfo 2002721_D001.pdf
Title:
Creator: wkhtmltopdf 0.12.6
Producer: Qt 4.8.7
CreationDate: Wed Sep 21 00:47:49 2022 BST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 150
Encrypted: no
Page size: 595 x 842 pts (A4)
Page rot: 0
File size: 3066222 bytes
Optimized: no
PDF version: 1.4
Worked fine: https://argos-support.co.uk/instruction-manual/4840754-bush-32-inch-hd-eled-tv-hd-ready.pdf
$ pdfinfo 4840754.pdf
Creator: Adobe InDesign CC 14.0 (Windows)
Producer: Adobe PDF Library 15.0
CreationDate: Mon May 25 11:56:23 2020 BST
ModDate: Mon May 25 11:56:34 2020 BST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 50
Encrypted: no
Page size: 595.276 x 841.89 pts (A4)
Page rot: 0
File size: 6979908 bytes
Optimized: yes
PDF version: 1.4
During further investigation, I identified that pdftotext++
incorrectly appends characters to figures instead of the main page content when processing PDFs generated by wkhtmltopdf
. This misclassification stems from discrepancies between the character's clipbox and the page's clipbox.
As I understand it, a clipbox defines a rectangular area within a PDF where content (text, images, etc.) is allowed to appear. It essentially acts as a boundary for rendering content.
Observed Behaviour:
Page clipbox: leftX: 0; upperY: 0; rightX: 595.276; lowerY: 841.89 └─ clipbox: leftX: 0; upperY: 0; rightX: 595.276; lowerY: 841.89 Append to page 7.
wkhtmltopdf
):
Page clipbox: leftX: 0; upperY: 0; rightX: 595; lowerY: 842 └─ clipbox: leftX: 43.5; upperY: 88.1366; rightX: 551.25; lowerY: 798.549 Append to figure figure-3XvIjyWO.
pdftotext++
performs an exact equality check (within a small tolerance) between the character's clipbox and the page's clipbox.wkhtmltopdf
PDFs often have different clipbox dimensions for content areas compared to the full page size, leading to failed equality checks.
Hello,
We've been using a variant of pdftotext++ with a LLM in order to parse invoices, can confirm it does a much better job 'grouping' text together for better processing.
But unfortunately have had a strange issue with v0.0.3 where it 'drops' certain words for no reason we have been able to figure out, whilst a (much older) version works without issue
This docker image works:
adfreiburg/pdftotext
But the v0.0.03 binary does not
I was going to redact the personally identifiable information and add the pdf here but changing that fixes the issue, so its likely a layout issue. Would you be interested in having a look? If so, I can e-mail the original file