PDF sample - any way to improve extraction?

igvk commented 1 year ago

Here is an example of PDF that has some incorrectly extracted data (in stream mode): V_1.pdf

V_1

Multi-line text isn't interpreted as such, and as a result it is very sparsely distributed into rows.
Number 50 in the last row and column 4 is moved to the next cell to the right and merged with it.

Is it possible to improve the extraction of this table?

foarsitter commented 1 year ago

Thanks for providing an example. It will be very hard to extract that table. Don't expect an improvement in foreseeable future.

kdshreyas commented 1 year ago

Hello @igvk,

I managed to tweak around some parameters of Camelot and I am getting good results, Check out the attached CSV and screenshot..

V_1.csv

code I used to achieve this:

!wget https://github.com/camelot-dev/camelot/files/12279247/V_1.pdf

tables = camelot.read_pdf('/kaggle/working/V_1.pdf', flavor='stream', row_tol=30)
print(tables)

df = tables[0].df
#replcace newline character with spcace to make it look clean
df = df.replace('\n', ' ', regex=True)

df

# df.to_csv('V_1.csv', index = False)

let me know if it works for you!

igvk commented 1 year ago

@kdshreyas, Yes, providing row tolerance makes it work better. Though there are some problems that are hard to solve (without providing exact column positions). In this file, this is number 50 in column 4. Besides, table header is shifted to the same line as column header of columns 3 & 4. The only solution that works good enough for me for this table is to provide the list of column positions.

kdshreyas commented 1 year ago

@igvk, I see, providing column parameters mostly works, but makes it hard-coded! Thanks for your feedback!

patlachance commented 7 months ago

I'm using pyMuDPF to find out the column positions before calling camelot.read_pdf.

I hope the following will help!

Some helper functions

    def fitz_extract_tables(self, pages=['all']):
        """ Returns an array of TableFinder objects
        Inputs:
            - pageNum: the page number to process

        Returns:
            - an array of TableFinder objects
        """
        result = []
        # Get the number of pages
        #num_pages = pages if pages != 'all' else self.fitz_doc.page_count
        # Open the PDF file using PyMuPDF
        with fitz.open(self.filename) as fitz_doc:
            num_pages = fitz_doc.page_count
            for p in range(0, num_pages):
                # Skip page if not in the list
                if pages != ['all']:
                    if p not in pages:
                        continue

                page = fitz_doc[p]
                # Extract text from the page
                #result = page.find_tables(vertical_strategy='text',horizontal_strategy='text', min_words_vertical=3, min_words_horizontal=2)
                result.append(page.find_tables())
            return result

    def fitz_get_table_inner_columns(self, pageArray, tableId, to_integer=False):
        """ Returns an array of float coordinates of the inner column of a table
        Inputs:
            - pageNum: the page number to process
            - tableNum: the id of the table on the given page 

        Returns:
            - an array of floats or integers
        """
        result = []
        # Fetch the tables on the given page
        tables = self.fitz_extract_tables(pageArray)

        # fitz_extract_tables returns a list of TableFinder objects, on each pages
        # our table is on the first processed page
        table = tables[0][tableId-1]
        df = table.to_pandas()

        # finding the inner columns from the cell coordinates of the first table row
        columns_left  = []
        columns_right = []
        r = 0
        for c in range(0, len(table.rows[r].cells)):
            if table.rows[r].cells[c] != None:
                # Convert the string to a list using ast.literal_eval()
                bbox = ast.literal_eval(str(table.rows[r].cells[c]))
                # bbox contains [x0, y0, x1, y1]
                # we want the x coordinates
                columns_left.append(bbox[0])
                columns_right.append(bbox[2])

        if to_integer == True:
            columns_left = [int(round(x, 0)) for x in columns_left]
            columns_right = [int(round(x, 0)) for x in columns_right]

And then

        columns_array = self.fitz_get_table_inner_columns([0], 2, to_integer=True)
        columns = [','.join(str(x) for x in columns_array)] * 10
        edge_tol = 70
        self.tables = camelot.read_pdf(self.filename, flavor=self.flavor, pages=pages, columns=columns, edge_tol=edge_tol, split_text=True)

camelot-dev / camelot

PDF sample - any way to improve extraction? #393