UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Certain PDFs when read results in duplicate strings that does not exist in the document #542

Closed rajasekarshanmugam closed 1 year ago

rajasekarshanmugam commented 1 year ago

Actual Behavior: We are processing the attached document. When we are looking into the parsed content, we are seeing duplicate strings being read. For matched lines/segments - there are no. of duplicate characters. However, using adobe/pdfxchange/chrome readers, we don't see the characters or the duplicate words. If we copy-paste the text to notepad, there are no duplicates too.

Expected Behavior: Document is read as per the content without any duplicates

Steps to reproduce: Using the attached document, reading the PDF document using the library, and printing the results, we see the duplicates. Actually tried to add some console messages in the default word extractor and its not per the text that's displayed.

Details: Added some debug statements - In the class DefaultWordExtractor :: public IEnumerable GetWords(IReadOnlyList letters)

BEFORE - indicates the letters that are read from the document - notice that the word "publisher" is already repeated.

AFTER - indicates the letters after sorting by Y-Descending and then by X-Ascending. Notice that the publisher became - PPUUBBLLIISSHHEERR

image

Below is the document that has this error. Document-PublisherError.pdf

fnatzke commented 1 year ago

@rajasekarshanmugam, will have a look tomorrow.

BobLd commented 1 year ago

@rajasekarshanmugam can you check if the following does the job for you? https://github.com/UglyToad/PdfPig/blob/c74ca5fda8a51f3af9cf19c675272f57d3beee60/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DuplicateOverlappingTextProcessor.cs

It's supposed to handle this kind of case

I guess that it is the same issue as https://github.com/UglyToad/PdfPig/issues/471 that ws fixed by DuplicateOverlappingTextProcessor.cs

rajasekarshanmugam commented 1 year ago

@rajasekarshanmugam can you check if the following does the job for you? https://github.com/UglyToad/PdfPig/blob/c74ca5fda8a51f3af9cf19c675272f57d3beee60/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DuplicateOverlappingTextProcessor.cs

It's supposed to handle this kind of case

I guess that it is the same issue as #471 that ws fixed by DuplicateOverlappingTextProcessor.cs

got this working - workaround the problem with a wrapper API. Thank you so much - @BobLd