Is there a way to parse the whole pdf and the tables alone with gmft

sahilarora3117 commented 1 month ago

Hi, this is an amazing project. I wanted to integrate this for RAG and wanted to use gmft to parse tables, which parsing normal content in the pdf too. Can you share an example where this is possible. Thanks

conjuncts commented 1 month ago

Hi, great question. There are a few options:

table # type: gmft.CroppedTable
print(' '.join(text for _,_,_,_,text in table.text_positions(outside=True)))

But that won't work if there are multiple tables. That also loses pymupdf's newline placement.

For this use case, I've actually been using native pymupdf:

# Setup
doc = pymupdf.open('notebooks/samples/stats.pdf') # type: pymupdf.Document
table # type: gmft.CroppedTable

# Code
to_dict = table.to_dict()
page_no = to_dict['page_no'] # table.page.page_number
page = doc[page_no]
rect = to_dict['bbox'] # table.bbox
annot = page.add_redact_annot(rect) # https://github.com/pymupdf/PyMuPDF/issues/698
page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) # You can apply multiple redactions, so multiple tables per page should work

for page in doc:
    print(page.get_text())

Pymupdf is a fantastic library. The only reason why I don't have pymupdf in the main lib is the license issue.

So those two are the current options for getting text outside of tables. When turning the table into content for RAG, I recommend turning the dataframe into markdown

In general, I find gpt models' performance on tables to be as follows:

markdown ~ latex > html > csv-plus >> tsv ~ csv >> native pdf formatting (space-separated) csv-plus: slight modification of csv, where an extra space is after each comma.

After getting the text inside and outside the tables, I simply concatenate. Placing the document in the correct location in the document flow is probably possible with some effort, but unfortunately I don't have an example.

conjuncts commented 3 weeks ago

Hello. I finally wrote some prototypical code that does this. The pymupdf path is definitely higher quality but requires you to abide by the stricter AGPL license. https://github.com/conjuncts/gmft/commit/795f229c29b411cbe4b39e5209307880d35592a7

conjuncts / gmft

Is there a way to parse the whole pdf and the tables alone with gmft #12