Closed rajasekarshanmugam closed 1 year ago
@rajasekarshanmugam, will have a look tomorrow.
@rajasekarshanmugam can you check if the following does the job for you? https://github.com/UglyToad/PdfPig/blob/c74ca5fda8a51f3af9cf19c675272f57d3beee60/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DuplicateOverlappingTextProcessor.cs
It's supposed to handle this kind of case
I guess that it is the same issue as https://github.com/UglyToad/PdfPig/issues/471 that ws fixed by DuplicateOverlappingTextProcessor.cs
@rajasekarshanmugam can you check if the following does the job for you? https://github.com/UglyToad/PdfPig/blob/c74ca5fda8a51f3af9cf19c675272f57d3beee60/src/UglyToad.PdfPig.DocumentLayoutAnalysis/DuplicateOverlappingTextProcessor.cs
It's supposed to handle this kind of case
I guess that it is the same issue as #471 that ws fixed by
DuplicateOverlappingTextProcessor.cs
got this working - workaround the problem with a wrapper API. Thank you so much - @BobLd
Actual Behavior: We are processing the attached document. When we are looking into the parsed content, we are seeing duplicate strings being read. For matched lines/segments - there are no. of duplicate characters. However, using adobe/pdfxchange/chrome readers, we don't see the characters or the duplicate words. If we copy-paste the text to notepad, there are no duplicates too.
Expected Behavior: Document is read as per the content without any duplicates
Steps to reproduce: Using the attached document, reading the PDF document using the library, and printing the results, we see the duplicates. Actually tried to add some console messages in the default word extractor and its not per the text that's displayed.
Details: Added some debug statements - In the class DefaultWordExtractor :: public IEnumerable GetWords(IReadOnlyList letters)
BEFORE - indicates the letters that are read from the document - notice that the word "publisher" is already repeated.
AFTER - indicates the letters after sorting by Y-Descending and then by X-Ascending. Notice that the publisher became - PPUUBBLLIISSHHEERR
Below is the document that has this error. Document-PublisherError.pdf