camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
3.01k stars 473 forks source link

two single row tables in two separate pdfs don't bet read by camelot as tables #268

Open myrhillion opened 3 years ago

myrhillion commented 3 years ago

Windows-10-10.0.19043-SP0 Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.2 OpenCV 4.5.3 Camelot 0.10.1

Describe the bug 2 of 55 pdfs with K-12 education table data have one row tables that don't process as tables. https://www.dpi.nc.gov/media/8350/open https://www.dpi.nc.gov/media/8325/open

It doesn't find a table, likely related to one row entry.

Steps to reproduce the bug ran: tables = camelot.read_pdf(weburl, pages='all') where weburl is set to the above two urls in a loop.

Expected behavior

Should have one row table output for these two separate 1 page pdfs.

Code

tables = camelot.read_pdf(weburl, pages='all')

import camelot

# add your code here

PDF

https://www.dpi.nc.gov/media/8350/open https://www.dpi.nc.gov/media/8325/open

Screenshots

Environment

Windows-10-10.0.19043-SP0 Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.2 OpenCV 4.5.3 Camelot 0.10.1

Additional context

there are 55 educator prep urls being cycled through in a loop. These two failed to produce tables, and I bet it's related to only have one row entry after headers or something.

tiagosamaha commented 3 years ago

@myrhillion

Did you catch some error or just not return the right table?

tiagosamaha commented 3 years ago

If just not return the table, I guess it's because the background color from table.

Try to use line_scale (docs) argument to get fine tune on line detection.

tables = camelot.read_pdf(weburl, line_scale=45, pages='all')

Another thing to try is process_background (docs), because header line is totally black.

myrhillion commented 3 years ago

Yeah it wasn't returning a table on those two examples, I thought it may have been due to one line tables. It was odd that out of 55 pdfs with similar table formatting, the only two that failed to return tables were the one data row tables in those 2 pdfs.

I'll try the suggestions you provided and see if that works when I can. Thank you.

On Fri, Oct 1, 2021 at 7:29 PM Tiago Samaha Cordeiro < @.***> wrote:

If just not return the table, I guess it's because the background color from table.

Try to use line_scale (docs https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines) argument to get fine tune on line detection.

tables = camelot.read_pdf(weburl, line_scale=45, pages='all')

Another thing to try is background_process (docs https://camelot-py.readthedocs.io/en/master/user/advanced.html#process-background-lines), because header line is totally black.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/camelot-dev/camelot/issues/268#issuecomment-932632483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA323TJVXCVIB6MGLPCSZG3UEY74TANCNFSM5EYJW73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Doug Taggart

tiagosamaha commented 2 years ago

@myrhillion worked?

myrhillion commented 2 years ago

Sorry, I haven't had time to try the fix on this project yet. Had to back burn it for a bit.

On Tue, Nov 9, 2021 at 10:53 AM Tiago Samaha Cordeiro < @.***> wrote:

@myrhillion https://github.com/myrhillion worked?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/camelot-dev/camelot/issues/268#issuecomment-964283210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA323TK65QFR3GU6J7V6E23ULE7ZBANCNFSM5EYJW73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Doug Taggart

Jaszkowic commented 1 year ago

I encountered the same problem. In this PDF on page 3, there is a one-row table which does not get extracted. All other tables in this PDF are correctly getting extracted. The tables also have identical / similar structure, colors and border lines. Playing around with line_scale and process_background did not change anything.

My configuration (camelot version 0.9.0):

camelot.read_pdf(
    "S19-12107.pdf",
    pages="all",
    flavor="lattice",
    suppress_stdout=True
)

Screenshot of table: Bildschirmfoto 2023-08-28 um 17 21 49