UglyToad / PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)
https://github.com/UglyToad/PdfPig/wiki
Apache License 2.0
1.73k stars 241 forks source link

Page text differs from visual rendering #495

Closed aagubanov closed 1 year ago

aagubanov commented 2 years ago

TextExtraction.pdf

For the attached sample, page text was extracted with the property Page.Text. The following shortcomings are noted:

1) Text order differs from visible one. For example, page number is returned at the very beginning, while on the visual rendering it occupies the very bottom of the page. It would be helpful to have text as it is visible rather as it is contained inside the PDF structure.

2) Text is not split into paragraphs. Instead, text blocks are rather separated by a couple of spaces. It would be helpful to have linefeeds between paragraphs (or at least a special separator that could be handled programmatically).

EliotJones commented 1 year ago

page.Text is raw content stream output. This is to do with how PDFs internally represent text. For better extraction considering using a combination of layout analysis tools to construct the information you need: https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis

aagubanov commented 1 year ago

page.Text is raw content stream output. This is to do with how PDFs internally represent text. For better extraction considering using a combination of layout analysis tools to construct the information you need: https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis

As specified at the page https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis, "In this example, the decoration blocks (header and page number) are not ordered correctly". The document attached to this ticket is of a similar structure. Layout analysis tools would not help in such a case.

Note that Decoration Text Block Classifier requires at least 2 pages (which is not the case for the attached document) and, what is more essential, requires the whole collection of document pages as a list, i. e. is incompatible with lazy page-to-page analysis.