Open xdave opened 2 weeks ago
To clarify, are you looking to extract text formatting like headers/bold/italic/superscript/subscript?
To clarify, are you looking to extract text formatting like headers/bold/italic/superscript/subscript?
Not necessarily, more like paragraphs, lists, tables. Structure-related.
To be honest, headers, bold, italic would be useful, too... I haven't needed superscript or subscript for anything. Full disclosure, I was trying to see if there was an already-built-in way to do this because I'm trying to find a better approach getting it into better markdown by using py-markdownify or something.
The table detection and table formatting is working wonderfully for targeting markdown; however, the PyPDFium2 formatting of non-tables is quite lacking. Unfortunately, I need to move away from pymupdf because of two reasons:
I've been reading the docs and testing things out, and I can't seem to find a way to generate HTML in the same way that I can with embed_tables() for markdown. And if there is a way, I would need it preserve more of the original text formatting.
Any pointers? Thanks.