camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
3.04k stars 474 forks source link

Have .read_pdf() show us which page it is processing for large PDF files. #507

Open bulrush15 opened 2 months ago

bulrush15 commented 2 months ago

I may have PDF files of 400+ pages or more, each page with a table. We could use an option in .read_pdf() where Camelot tells us which page it is starting to process, or it has processed.

Alternatively, how can we make a loop to process one page at a time where I can print my own message to show which page is being processed.

bosd commented 2 months ago

Hey!

As #343, we try to build a maintained fork at pypdf_table_extraction.

This specific feature is not implemented. But there is support for parallel processing to speedup the process for large files. Which you may find usefull.

bulrush15 commented 2 months ago

Thank you @bosd. But we may end up processing many large files so in my status message I would still want to show the file I'm processing and the page that is being processed.

I may be able to process multiple pages in a loop like this:

# From Gemini AI. 
import camelot
import pandas as pd

# Replace 'your_pdf_file.pdf' with the actual path to your PDF file
pdf_file = 'your_pdf_file.pdf'

# Extract tables from the PDF file
tables = camelot.read_pdf(pdf_file)

# Iterate through the extracted tables
for table in tables:
    # Convert the table to a pandas DataFrame
    df = table.df

    # Save the DataFrame as a UTF-8 CSV file
    csv_file = 'output.csv'
    df.to_csv(csv_file, index=False, encoding='utf-8')

    print(f"Table {table.index} saved as {csv_file}")