atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.67k stars 360 forks source link

Enhance Camelot's Table Extraction to Exclude Specific Rows Based on Alignment Issues #504

Open iammkullah opened 9 months ago

iammkullah commented 9 months ago

I am using Camelot for table extraction in PDF documents, which generally works well for my needs. However, I've encountered a recurring issue where the first and last rows of tables cause problems during the extraction process, primarily due to their alignment. These rows often differ in format from the rest of the table, affecting the consistency and accuracy of the extracted data. Currently, Camelot does not seem to offer a direct way to exclude specific rows based on their characteristics or alignment.

This feature would be incredibly beneficial for scenarios where table headers or footers consistently deviate in style or alignment from the main table body, leading to extraction inaccuracies. A parameter or method to specify rows to ignore (by index or pattern recognition) during extraction could significantly improve the utility and flexibility of Camelot for users facing similar challenges.

Is there an existing solution or workaround to address this issue, or could this functionality be considered for future updates?

For the details. page 1 image page 2 ( long table and is on 2, 3, and 4 pages in some pdf) image

You can see because of this last row and first, it is making 11 columns for this data frame instead actually they are 10 columns. In my PDFs sometimes there are such footers (last row of the table on pdf) and (first row of header) which I am not interested in getting extracted and my header is after this.

I have already tried to play with line_tol, joint_tol, split_text, line_scale, shift_text, etc (and it works for smaller differences like in the 1st screenshot of page 1 it works but in the case of the second screenshot it fails.

Here is my appending tables function which makes a a single result_df for long tables

`

def append_tables_to_dataframe(tables): try: df_list = []

    for i, table in enumerate(tables):
        # If the table has at least 10 columns
        if table.shape[1] >= 10:
            # Handle header extraction for the first table
            if i == 0:

                # Find the index where "Date" is in the first cell
                date_index = table.df[table.df.iloc[:, 0].str.contains(r"\bD\s*a\s*t\s*e\s*o\s*f\s*T\s*r\s*a\s*n\s*s\s*a\s*c\s*t\s*i\s*o\s*n\b", case=False, regex=True)].index

                if not date_index.empty:
                    print("Got the row having header ...")
                    # header_index = date_index[0]
                    table.df.columns = range(len(table.df.columns))

            df_list.append(table.df)

    # Concatenate all tables in the DataFrame list
    result_df = pd.concat(df_list, ignore_index=True)

    return result_df

except Exception as e:
    print("Error in result_df creation:", e)
    return pd.DataFrame()  # Return an empty DataFrame in case of an error

`

Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?

I hope it is making sense, if not do let me know, I would love to explain more and if somehow you will be able to add this to Camelot it will make more powerful to this library.

Thanks

bosd commented 3 months ago

Hey all!

We try to build a maintained fork at pypdf_table_extraction.

You are welcome to check it out and contribute there. @iammkullah Can you open an issue there? (if it still exsists)

rodfloripa commented 3 months ago

I have the same problem. @iammkullah did you find any other solution?

iammkullah commented 3 months ago

@rodfloripa I haven't got any solution, then I handled this all in processing of the data

rodfloripa commented 3 months ago

Can you open an issue on https://github.com/py-pdf/pypdf_table_extraction ??

bosd commented 3 months ago

Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?

Have you tried setting table regions ? Or table areas?