camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.76k stars 446 forks source link

add __iter__() for TableList to support enumerate() #486

Open stonyw opened 5 months ago

stonyw commented 5 months ago
    tables = camelot.read_pdf(filename)
    for idx, table in enumerate(tables):   # Warining: Expected type 'Iterable[_T]', got 'TableList' instead  in pycharm
        pass
henrywman101 commented 5 months ago

comparison between table_areas and table_regions (with flavor='stream') table_areas recognize tables more accurate

When using Camelot's camelot.read_pdf function with table_areas and table_regions parameters, you're specifying the exact areas or regions of the page where you expect the tables to be. This is particularly useful for PDFs where tables are not well-detected using the default settings.

-

table_areas: This parameter expects a list of strings, where each string defines the coordinates of a rectangular area that contains a table. The format of the coordinates is "x1,y1,x2,y2" (in PDF points), where (x1, y1) is the top-left corner of the rectangle and (x2, y2) is the bottom-right corner.

table_regions: This parameter is used to specify regions where tables are expected. It's similar to table_areas but less precise. It's useful when you have multiple tables in a region.

Here's an example of how to use these parameters: [image: Screenshot 2024-02-01 at 02.39.30.png] [image: Screenshot 2024-02-01 at 04.13.45.png] [image: Screenshot 2024-02-01 at 02.23.55.png]

Message ID: @.***>

$ camelot stream -plot contour 13pg.pdf

MartinThoma commented 4 months ago

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?