Open igvk opened 1 year ago
Thanks for providing an example. It will be very hard to extract that table. Don't expect an improvement in foreseeable future.
Hello @igvk,
I managed to tweak around some parameters of Camelot and I am getting good results, Check out the attached CSV and screenshot..
code I used to achieve this:
!wget https://github.com/camelot-dev/camelot/files/12279247/V_1.pdf
tables = camelot.read_pdf('/kaggle/working/V_1.pdf', flavor='stream', row_tol=30)
print(tables)
df = tables[0].df
#replcace newline character with spcace to make it look clean
df = df.replace('\n', ' ', regex=True)
df
# df.to_csv('V_1.csv', index = False)
let me know if it works for you!
@kdshreyas, Yes, providing row tolerance makes it work better. Though there are some problems that are hard to solve (without providing exact column positions). In this file, this is number 50 in column 4. Besides, table header is shifted to the same line as column header of columns 3 & 4. The only solution that works good enough for me for this table is to provide the list of column positions.
@igvk, I see, providing column parameters mostly works, but makes it hard-coded! Thanks for your feedback!
I'm using pyMuDPF to find out the column positions before calling camelot.read_pdf.
I hope the following will help!
Some helper functions
def fitz_extract_tables(self, pages=['all']):
""" Returns an array of TableFinder objects
Inputs:
- pageNum: the page number to process
Returns:
- an array of TableFinder objects
"""
result = []
# Get the number of pages
#num_pages = pages if pages != 'all' else self.fitz_doc.page_count
# Open the PDF file using PyMuPDF
with fitz.open(self.filename) as fitz_doc:
num_pages = fitz_doc.page_count
for p in range(0, num_pages):
# Skip page if not in the list
if pages != ['all']:
if p not in pages:
continue
page = fitz_doc[p]
# Extract text from the page
#result = page.find_tables(vertical_strategy='text',horizontal_strategy='text', min_words_vertical=3, min_words_horizontal=2)
result.append(page.find_tables())
return result
def fitz_get_table_inner_columns(self, pageArray, tableId, to_integer=False):
""" Returns an array of float coordinates of the inner column of a table
Inputs:
- pageNum: the page number to process
- tableNum: the id of the table on the given page
Returns:
- an array of floats or integers
"""
result = []
# Fetch the tables on the given page
tables = self.fitz_extract_tables(pageArray)
# fitz_extract_tables returns a list of TableFinder objects, on each pages
# our table is on the first processed page
table = tables[0][tableId-1]
df = table.to_pandas()
# finding the inner columns from the cell coordinates of the first table row
columns_left = []
columns_right = []
r = 0
for c in range(0, len(table.rows[r].cells)):
if table.rows[r].cells[c] != None:
# Convert the string to a list using ast.literal_eval()
bbox = ast.literal_eval(str(table.rows[r].cells[c]))
# bbox contains [x0, y0, x1, y1]
# we want the x coordinates
columns_left.append(bbox[0])
columns_right.append(bbox[2])
if to_integer == True:
columns_left = [int(round(x, 0)) for x in columns_left]
columns_right = [int(round(x, 0)) for x in columns_right]
And then
columns_array = self.fitz_get_table_inner_columns([0], 2, to_integer=True)
columns = [','.join(str(x) for x in columns_array)] * 10
edge_tol = 70
self.tables = camelot.read_pdf(self.filename, flavor=self.flavor, pages=pages, columns=columns, edge_tol=edge_tol, split_text=True)
Here is an example of PDF that has some incorrectly extracted data (in stream mode): V_1.pdf
Multi-line text isn't interpreted as such, and as a result it is very sparsely distributed into rows.
Number 50 in the last row and column 4 is moved to the next cell to the right and merged with it.
Is it possible to improve the extraction of this table?