Closed aagubanov closed 1 year ago
page.Text
is raw content stream output. This is to do with how PDFs internally represent text. For better extraction considering using a combination of layout analysis tools to construct the information you need: https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis
page.Text
is raw content stream output. This is to do with how PDFs internally represent text. For better extraction considering using a combination of layout analysis tools to construct the information you need: https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis
As specified at the page https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Analysis, "In this example, the decoration blocks (header and page number) are not ordered correctly". The document attached to this ticket is of a similar structure. Layout analysis tools would not help in such a case.
Note that Decoration Text Block Classifier requires at least 2 pages (which is not the case for the attached document) and, what is more essential, requires the whole collection of document pages as a list, i. e. is incompatible with lazy page-to-page analysis.
TextExtraction.pdf
For the attached sample, page text was extracted with the property
Page.Text
. The following shortcomings are noted:1) Text order differs from visible one. For example, page number is returned at the very beginning, while on the visual rendering it occupies the very bottom of the page. It would be helpful to have text as it is visible rather as it is contained inside the PDF structure.
2) Text is not split into paragraphs. Instead, text blocks are rather separated by a couple of spaces. It would be helpful to have linefeeds between paragraphs (or at least a special separator that could be handled programmatically).