allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
818 stars 65 forks source link

Get table content #27

Open Sunnycheey opened 4 years ago

Sunnycheey commented 4 years ago

I want to know why you remove the table content while processing since the table content is structured and important in many situtation.

kyleclo commented 3 years ago

Hey @Sunnycheey, we decided that the quality of the tables was too low for practical usage & we decided not to include it as part of the release. We've been since working on how to improve table extraction so we that we might include it in future S2ORC releases. If you're looking for a S2ORC-like dataset that includes higher quality tables, you can check out https://github.com/allenai/cord19 in which we used IBM Research's table parsing software.