camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.91k stars 462 forks source link

Is it possible to extract only tables from pdf using camelot? Used flavor as stream, but still getting text paragraphs as tables #350

Open jainamshah535 opened 1 year ago

jainamshah535 commented 1 year ago

path1 = r"C:\Users\Downloads\PDF Extraction Project\Compensation_document.pdf" tables = camelot.read_pdf(path1, flavor='stream', pages='all')

print("Total tables extracted:", tables.n)

writer = pd.ExcelWriter(r"c:\temp\Compensation_document.xlsx") i = 1 for i in range(tables.n): print("-----------------------------------------",i) df2 = pd.DataFrame() sname = "Sheet" + str(i+1) df2 = tables[i].df df2.to_excel(writer,sheet_name = sname,index = False ) print(tables[i].df) writer.save()

For this Code I tried putting option sas edge_tool=500, row_tool=10, col_tool=10 etc. But still text paragraphs are detected as tables. Also how to automatically specify area regions for tables in pdf. and get coordinates automatically for code.

siddarthvader commented 1 year ago

did you find a workaround? I am also trying to extract tables and remaining text using camelot, but no success. @jainamshah535

bendlev commented 1 year ago

^^ Echoing the same here, it seems like table_area= argument is not limiting scope of Camelot to specified area