camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.76k stars 446 forks source link

How to combine tabular and non-tabular content from a PDF? #498

Open tpanza opened 1 month ago

tpanza commented 1 month ago

Thanks for a great tool. I haven't seen this addressed anywhere, so I'll ask it here.

I have some large PDFs that consist of tables and some "regular" text. What I'd like to do is convert the PDF to a single HTML (or Markdown) file that does a simple text extract for the non-tabular parts, but then uses Camelot for the tabular parts, while keeping the overall order of the document intact.

Basically, keep all of the content in order, but with the tabular data appropriately formatted in HTML/Markdown. For my situation, I want to keep the surrounding context before and after the tables.

Is there a way to do this? If not, might someone point me to where in the Camelot code would be a good place to insert such a patch?