Table are not detected - Githubissues

atlanhq / camelot

Camelot: PDF Table Extraction for Humans

https://camelot-py.readthedocs.io

Other

3.64k stars 354 forks source link

Table are not detected #289

Closed Mehroz01 closed 5 years ago

Mehroz01 commented 5 years ago

You can see this below pdf file, Page number 5. I am trying to extract this type of table. but they are not detected, kindly guide me if you know the way to extract table from pdf. I have uploaded both table image and pdf file(page #5 ) . Thank you 1-s2.0-S026130691100553X-main.pdf

table

anakin87 commented 5 years ago

I succeed using: tables = camelot.read_pdf('1-s2.0-S026130691100553X-main.pdf', pages='5', flavor='stream', table_areas=['0,710,250,650'])

Explanation:

flavor='stream' because the table hasn't demarcated lines between cells (see https://camelot-py.readthedocs.io/en/master/user/how-it-works.html#stream)
table_areas=['0,710,250,650'] using stream, it is advisable to specify table boundaries (see https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas); to find table boundaries, you can do some visual debugging (see https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging)

Mehroz01 commented 5 years ago

Thank you @anakin87 for your help :) . @anakin87 there is any way to find "table areas" automatically . when we have 100's of PDF and wanted to extract table, So in this case we can not gives all table areas manually . do you know the way ...? Thank you

vinayak-mehta commented 5 years ago

Stream might fail in cases where the table forms a small part of the page, which can be solved using the table_areas kwarg like @anakin87 suggested. You can also try using the table_regions kwarg (https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-regions) if your table always lies in an approximate location on the page.