Closed bogdankostic closed 3 years ago
Hi @bogdankostic, thank you for your feedback!
I have planned to detect automatically a document's layout and optimise the LAParams
accordingly..
but for now I have pushed a commit that improves support for column-based documents per default (less performant though)
source = FileSource(file_path, la_params=LAParams(boxes_flow=0.3, detect_vertical=True))
parser.parse_pdf(source)
If no advanced column-based analysis is needed, one can simply disable it with.
FileSource(file_path, la_params=LAParams(boxes_flow=None, detect_vertical=False))
Output for your example: Dense_Passage_Retrieval.txt
First of all, thanks for this project! On some of my pdf Documents, the exact hierarchical document structure was extracted.
Unfortunately, the result was not that good for some documents, especially for multi-column docs. For example, when I want to get the structure of this doc, the text order is quite messed up. Do you plan to fix this to support different types of documents?