Wrong text order - Githubissues

ChrizH / pdfstructure

`pdfstructure` detects, splits and organizes the documents text content into its natural structure as envisioned by the author.

97 stars 19 forks source link

Hi @bogdankostic, thank you for your feedback! I have planned to detect automatically a document's layout and optimise the LAParams accordingly.. but for now I have pushed a commit that improves support for column-based documents per default (less performant though)

source = FileSource(file_path, la_params=LAParams(boxes_flow=0.3, detect_vertical=True))
parser.parse_pdf(source)

If no advanced column-based analysis is needed, one can simply disable it with.

FileSource(file_path, la_params=LAParams(boxes_flow=None, detect_vertical=False))

Output for your example: Dense_Passage_Retrieval.txt

ChrizH / pdfstructure

Wrong text order #3