camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.78k stars 448 forks source link

pdf extraction has an issue while copying texts among vertically spanned cells #349

Open saidakyuz opened 1 year ago

saidakyuz commented 1 year ago

Describe the bug

I am extracting data from PDFs using camelot and am faced with the following issue on 3. page of this datasheet. The problematic table is shown below:

The issue is inconsistency during the copying content of span cells. As you can see on the following picture span cells are correctly detected.

Even if the cells are detected correctly in the 3. column the content is copied to one of two spanned cells and in the 4. column the content is copied to two of three spanned cells. You can see the data I extracted as follow. There is always one missing cell per both columns.

Steps to reproduce the bug

!pip install "camelot-py[cv]" -q
!pip install PyPDF2==2.12.1
!apt-get install ghostscript
import camelot
import pandas as pd
from tabulate import tabulate
import re
import fitz

Expected behavior

Code

table_areas=['86, 697, 529, 95'] # To ignore page borders
tables = camelot.read_pdf(single_source, pages='all', 
                          flavor = 'lattice', 
                          copy_text=['v'], 
                          line_scale = 110, 
                          table_regions=table_areas, 
                          flag_size = False, 
                          process_background=False)

to visualize the tables:

for table in accurate_tables:
  print(table.parsing_report, table.shape, table._bbox)
  print(tabulate(table.df, headers='keys', tablefmt='psql'))
  camelot.plot(table, kind='grid').show()

print("Extracting ", single_source, "is finished!")

PDF HERE

Screenshots

image image image

Environment

Additional context

saidakyuz commented 1 year ago

According to my observation, this happens only if the cells are spanned vertically and horizontally and there are some other cells that are not spanned horizontally on the same column with the cells two-dimensional spanned. Somehow each cell in the same row could have opposite values of vspan. (True or False) The issue caused by this attribute, but I still have no solution for it.

Followint tables have also same issue image image