conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
300 stars 21 forks source link

The header is not included as a row. Consider adding it back as a row #24

Open wassim opened 4 weeks ago

wassim commented 4 weeks ago

On large tables, header is skipped. Is there a way to disabled this behaviour? If no, how to add the header back please?

Invoking large table row guess! set TATRFormatConfig.force_large_table_assumption to False to disable this.
The header is not included as a row. Consider adding it back as a row.
from gmft import CroppedTable, TableDetector, AutoTableFormatter, AutoFormatConfig
from gmft.pdf_bindings import PyPDFium2Document
import pandas as pd
import sys
import io

detector = TableDetector()

config = AutoFormatConfig()
config.enable_multi_header = True
config.semantic_spanning_cells = True

formatter = AutoTableFormatter(config=config)

def ingest_pdf(pdf_path): # produces list[CroppedTable]
    doc = PyPDFium2Document(pdf_path)
    tables = []
    for page in doc:
        tables += detector.extract(page)
    return tables, doc

def extract_tables_to_csv(tables):
    output = io.StringIO()
    for table in tables:
        ft = formatter.extract(table)
        df = ft.df()

        # Write the DataFrame to CSV
        df.to_csv(output, index=False, header=output.tell()==0)

    return output.getvalue()

# Ingest PDF and extract tables
tables, doc = ingest_pdf(pdf_path)
csv_content = extract_tables_to_csv(tables)

with open(output_path, 'w', newline='') as f:
        f.write(csv_content)

doc.close()
conjuncts commented 2 weeks ago

Hi, I'm trying to reproduce the bug now. Would you mind if you submitted an example pdf? Submission form. Thanks.

conjuncts commented 2 weeks ago

Under closer examination, most of the time it is a harmless warning that doesn't change behavior. But in some cases there is a bug so I will fix