Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.34k stars 89 forks source link

PyMuPdf Hierarchal Headings #35

Closed mingzhang798 closed 3 months ago

mingzhang798 commented 4 months ago

Description

Can you combine pymupdf's pdf4llm.to_markdown() to make the parsed pdf more hierarchical (for example, use ("##", "Header 1") to represent the first-level heading, ("###", "Header 2") represents the second-level heading, ("####", "Header 3") represents the third-level heading, etc.), so that langchain can be used to parse using the MarkdownHeaderTextSplitter() method. link: https://python.langchain.com/docs/modules/data_connection/document_transformers/markdown_header_metadata/

Filimoa commented 4 months ago

Could you provide some examples of before and after?

Filimoa commented 3 months ago

Closing due to inactivity