VikParuchuri / marker

Convert PDF to markdown quickly with high accuracy
https://www.datalab.to
GNU General Public License v3.0
14.15k stars 720 forks source link

Enhancement: use tabula to extract table data more precisely #156

Closed vulcano9 closed 1 month ago

vulcano9 commented 1 month ago

Dear Mr. Paruchuri,

thank you for this wonderful tool.

Is it possible to integrate tabula into marker to extract tables with higher precision? There is a python wrapper available: tabula-py. The accuracy of column detection in tabula is very good, while in marker it is less accurate. On the other hand tabula has the problem that it utilizes automatic table recognition, but this is often inaccurate (in my tests, the first row is often cut off). As I understand it, surya recognizes the table borders, so these coordinates would just need to be passed to tabula. The best of both worlds would be to use the table recognition by marker and the conversion of the tables by tabula. Tabula outputs a dataframe or a CSV file. The dataframe or csv would then still need to be converted to Markdown.

What do you think of this idea?

VikParuchuri commented 1 month ago

I am interested in improving table detection, but incorporating tabula won't work because it has a complicated dependency chain (needs Java), and isn't actively maintained.

vulcano9 commented 1 month ago

Forgive me for asking again, but what do you think of https://github.com/poloclub/unitable to address the limitation: "Tables are not always formatted 100% correctly - text can be in the wrong column."

MjiS commented 1 month ago

Forgive me for asking again, but what do you think of https://github.com/poloclub/unitable to address the limitation: "Tables are not always formatted 100% correctly - text can be in the wrong column."

This looks good as the table structure decoder keeps the table formatting.

I had used marker,it is good but it doesn't work accurately on merged columns/rows in pdfs and also on sub sections too. I hope the author or other contributors will look in to this.