Open iammkullah opened 9 months ago
Hey all!
We try to build a maintained fork at pypdf_table_extraction.
You are welcome to check it out and contribute there. @iammkullah Can you open an issue there? (if it still exsists)
I have the same problem. @iammkullah did you find any other solution?
@rodfloripa I haven't got any solution, then I handled this all in processing of the data
Can you open an issue on https://github.com/py-pdf/pypdf_table_extraction ??
Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?
Have you tried setting table regions ? Or table areas?
I am using Camelot for table extraction in PDF documents, which generally works well for my needs. However, I've encountered a recurring issue where the first and last rows of tables cause problems during the extraction process, primarily due to their alignment. These rows often differ in format from the rest of the table, affecting the consistency and accuracy of the extracted data. Currently, Camelot does not seem to offer a direct way to exclude specific rows based on their characteristics or alignment.
This feature would be incredibly beneficial for scenarios where table headers or footers consistently deviate in style or alignment from the main table body, leading to extraction inaccuracies. A parameter or method to specify rows to ignore (by index or pattern recognition) during extraction could significantly improve the utility and flexibility of Camelot for users facing similar challenges.
Is there an existing solution or workaround to address this issue, or could this functionality be considered for future updates?
For the details. page 1 page 2 ( long table and is on 2, 3, and 4 pages in some pdf)
You can see because of this last row and first, it is making 11 columns for this data frame instead actually they are 10 columns. In my PDFs sometimes there are such footers (last row of the table on pdf) and (first row of header) which I am not interested in getting extracted and my header is after this.
I have already tried to play with line_tol, joint_tol, split_text, line_scale, shift_text, etc (and it works for smaller differences like in the 1st screenshot of page 1 it works but in the case of the second screenshot it fails.
Here is my appending tables function which makes a a single result_df for long tables
`
def append_tables_to_dataframe(tables): try: df_list = []
`
Is there any way we can search the appearing text or anything else to exclude that first row and last while extraction so that Camelot focuses on the main table (containing data we are interested in)?
I hope it is making sense, if not do let me know, I would love to explain more and if somehow you will be able to add this to Camelot it will make more powerful to this library.
Thanks