conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
344 stars 23 forks source link

Any way to convert a whole PDF with tables directly to HTML? #33

Open xdave opened 2 weeks ago

xdave commented 2 weeks ago

The table detection and table formatting is working wonderfully for targeting markdown; however, the PyPDFium2 formatting of non-tables is quite lacking. Unfortunately, I need to move away from pymupdf because of two reasons:

  1. the license issue, and
  2. there are many PDFs that i come across randomly that take astronomically large amounts of time to process with pymupdf, and I want to improve the processing time on CPU

I've been reading the docs and testing things out, and I can't seem to find a way to generate HTML in the same way that I can with embed_tables() for markdown. And if there is a way, I would need it preserve more of the original text formatting.

Any pointers? Thanks.

conjuncts commented 2 weeks ago

To clarify, are you looking to extract text formatting like headers/bold/italic/superscript/subscript?

xdave commented 2 weeks ago

To clarify, are you looking to extract text formatting like headers/bold/italic/superscript/subscript?

Not necessarily, more like paragraphs, lists, tables. Structure-related.

xdave commented 2 weeks ago

To be honest, headers, bold, italic would be useful, too... I haven't needed superscript or subscript for anything. Full disclosure, I was trying to see if there was an already-built-in way to do this because I'm trying to find a better approach getting it into better markdown by using py-markdownify or something.

conjuncts commented 2 weeks ago

I see. My library does not detect those sorts of structure, unfortunately.

If you can spare the CPU cycles, try marker