Open myrhillion opened 3 years ago
@myrhillion
Did you catch some error or just not return the right table?
If just not return the table, I guess it's because the background color from table.
Try to use line_scale (docs) argument to get fine tune on line detection.
tables = camelot.read_pdf(weburl, line_scale=45, pages='all')
Another thing to try is process_background
(docs), because header line is totally black.
Yeah it wasn't returning a table on those two examples, I thought it may have been due to one line tables. It was odd that out of 55 pdfs with similar table formatting, the only two that failed to return tables were the one data row tables in those 2 pdfs.
I'll try the suggestions you provided and see if that works when I can. Thank you.
On Fri, Oct 1, 2021 at 7:29 PM Tiago Samaha Cordeiro < @.***> wrote:
If just not return the table, I guess it's because the background color from table.
Try to use line_scale (docs https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines) argument to get fine tune on line detection.
tables = camelot.read_pdf(weburl, line_scale=45, pages='all')
Another thing to try is background_process (docs https://camelot-py.readthedocs.io/en/master/user/advanced.html#process-background-lines), because header line is totally black.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/camelot-dev/camelot/issues/268#issuecomment-932632483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA323TJVXCVIB6MGLPCSZG3UEY74TANCNFSM5EYJW73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Doug Taggart
@myrhillion worked?
Sorry, I haven't had time to try the fix on this project yet. Had to back burn it for a bit.
On Tue, Nov 9, 2021 at 10:53 AM Tiago Samaha Cordeiro < @.***> wrote:
@myrhillion https://github.com/myrhillion worked?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/camelot-dev/camelot/issues/268#issuecomment-964283210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA323TK65QFR3GU6J7V6E23ULE7ZBANCNFSM5EYJW73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Doug Taggart
I encountered the same problem. In this PDF on page 3, there is a one-row table which does not get extracted. All other tables in this PDF are correctly getting extracted. The tables also have identical / similar structure, colors and border lines. Playing around with line_scale
and process_background
did not change anything.
My configuration (camelot version 0.9.0):
camelot.read_pdf(
"S19-12107.pdf",
pages="all",
flavor="lattice",
suppress_stdout=True
)
Screenshot of table:
Windows-10-10.0.19043-SP0 Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.2 OpenCV 4.5.3 Camelot 0.10.1
Describe the bug 2 of 55 pdfs with K-12 education table data have one row tables that don't process as tables. https://www.dpi.nc.gov/media/8350/open https://www.dpi.nc.gov/media/8325/open
It doesn't find a table, likely related to one row entry.
Steps to reproduce the bug ran: tables = camelot.read_pdf(weburl, pages='all') where weburl is set to the above two urls in a loop.
Expected behavior
Should have one row table output for these two separate 1 page pdfs.
Code
tables = camelot.read_pdf(weburl, pages='all')
PDF
https://www.dpi.nc.gov/media/8350/open https://www.dpi.nc.gov/media/8325/open
Screenshots
Environment
Windows-10-10.0.19043-SP0 Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.2 OpenCV 4.5.3 Camelot 0.10.1
Additional context
there are 55 educator prep urls being cycled through in a loop. These two failed to produce tables, and I bet it's related to only have one row entry after headers or something.