ChrizH / pdfstructure

`pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.
97 stars 19 forks source link

Wrong text order #3

Closed bogdankostic closed 3 years ago

bogdankostic commented 3 years ago

First of all, thanks for this project! On some of my pdf Documents, the exact hierarchical document structure was extracted.

Unfortunately, the result was not that good for some documents, especially for multi-column docs. For example, when I want to get the structure of this doc, the text order is quite messed up. Do you plan to fix this to support different types of documents?

ChrizH commented 3 years ago

Hi @bogdankostic, thank you for your feedback! I have planned to detect automatically a document's layout and optimise the LAParams accordingly.. but for now I have pushed a commit that improves support for column-based documents per default (less performant though)

source = FileSource(file_path, la_params=LAParams(boxes_flow=0.3, detect_vertical=True))
parser.parse_pdf(source)

If no advanced column-based analysis is needed, one can simply disable it with.

FileSource(file_path, la_params=LAParams(boxes_flow=None, detect_vertical=False))

Output for your example: Dense_Passage_Retrieval.txt