ad-freiburg / pdftotext-plus-plus

A fast and accurate command line tool for extracting text from PDF files.
https://pdftotext.cs.uni-freiburg.de
Apache License 2.0
15 stars 0 forks source link

pdftotext ignores words for no discernable reason #30

Open DGollings opened 10 months ago

DGollings commented 10 months ago

Hello,

We've been using a variant of pdftotext++ with a LLM in order to parse invoices, can confirm it does a much better job 'grouping' text together for better processing.

But unfortunately have had a strange issue with v0.0.3 where it 'drops' certain words for no reason we have been able to figure out, whilst a (much older) version works without issue

This docker image works: adfreiburg/pdftotext

But the v0.0.03 binary does not

I was going to redact the personally identifiable information and add the pdf here but changing that fixes the issue, so its likely a layout issue. Would you be interested in having a look? If so, I can e-mail the original file

kwakwaversal commented 3 days ago

I have come across the same problem. I don't know why it is happening, but commenting out the if block at https://github.com/ad-freiburg/pdftotext-plus-plus/blob/cca94e9f3e80c5df91847394d353f9af7808fb3a/src/PdfParsing.cpp#L535-L543 so that it just adds the character to the page fixes it. I understand that this is not the correct solution.

It appears that pdftotext++ is recognizing individual characters within PDFs but is not successfully extracting the complete text as expected, especially for PDFs produced by wkhtmltopdf in my case.

Did not work: https://argos-support.co.uk/instruction-manual/2002721-lg-43-inch-43uq75006lf-smart-4k-uhd-hdr-led-freeview-tv.pdf

$ pdfinfo 2002721_D001.pdf
Title:
Creator:        wkhtmltopdf 0.12.6
Producer:       Qt 4.8.7
CreationDate:   Wed Sep 21 00:47:49 2022 BST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          150
Encrypted:      no
Page size:      595 x 842 pts (A4)
Page rot:       0
File size:      3066222 bytes
Optimized:      no
PDF version:    1.4

Worked fine: https://argos-support.co.uk/instruction-manual/4840754-bush-32-inch-hd-eled-tv-hd-ready.pdf

$ pdfinfo 4840754.pdf
Creator:        Adobe InDesign CC 14.0 (Windows)
Producer:       Adobe PDF Library 15.0
CreationDate:   Mon May 25 11:56:23 2020 BST
ModDate:        Mon May 25 11:56:34 2020 BST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          50
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      6979908 bytes
Optimized:      yes
PDF version:    1.4
kwakwaversal commented 2 days ago

During further investigation, I identified that pdftotext++ incorrectly appends characters to figures instead of the main page content when processing PDFs generated by wkhtmltopdf. This misclassification stems from discrepancies between the character's clipbox and the page's clipbox.

Detailed Explanation

As I understand it, a clipbox defines a rectangular area within a PDF where content (text, images, etc.) is allowed to appear. It essentially acts as a boundary for rendering content.

Observed Behaviour:

Root Cause Analysis