boncey / Flickr4Java

Java API For Flickr. Fork of FlickrJ
BSD 2-Clause "Simplified" License
176 stars 155 forks source link

How to extract all tables from pdf, also with headers and footers of tables at the same time? #558

Closed mmariyam closed 2 years ago

mmariyam commented 2 years ago

How to extract all tables from pdf, also with headers and footers of tables at the same time?

How to extract all tables from pdf within a text, which is before and after table (some notes to table) in one excel file, tables separed to sheets with help of tabula?

This code only extract tables to the separate files.

pip3 install camelot-py[cv] tabula-py import tabula import os

tables = tabula.read_pdf("foo.pdf", pages="all")

save them in a folder

folder_name = "tables" if not os.path.isdir(folder_name): os.mkdir(folder_name)

iterate over extracted tables and export as excel individually

for i, table in enumerate(tables, start=1): table.to_excel(os.path.join(foldername, f"table{i}.xlsx"), index=False)

import tabula

pdf_path = "foo.pdf"

dfs = tabula.read_pdf(pdf_path, pages='all')

print(len(dfs))

for i in range(len(dfs)): dfs[i].tocsv(f"table{i}.csv")