Open bulrush15 opened 2 months ago
Hey!
As #343, we try to build a maintained fork at pypdf_table_extraction.
This specific feature is not implemented. But there is support for parallel processing to speedup the process for large files. Which you may find usefull.
Thank you @bosd. But we may end up processing many large files so in my status message I would still want to show the file I'm processing and the page that is being processed.
I may be able to process multiple pages in a loop like this:
# From Gemini AI.
import camelot
import pandas as pd
# Replace 'your_pdf_file.pdf' with the actual path to your PDF file
pdf_file = 'your_pdf_file.pdf'
# Extract tables from the PDF file
tables = camelot.read_pdf(pdf_file)
# Iterate through the extracted tables
for table in tables:
# Convert the table to a pandas DataFrame
df = table.df
# Save the DataFrame as a UTF-8 CSV file
csv_file = 'output.csv'
df.to_csv(csv_file, index=False, encoding='utf-8')
print(f"Table {table.index} saved as {csv_file}")
I may have PDF files of 400+ pages or more, each page with a table. We could use an option in
.read_pdf()
where Camelot tells us which page it is starting to process, or it has processed.Alternatively, how can we make a loop to process one page at a time where I can print my own message to show which page is being processed.