allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

Will you consider parsing the section field for each body_text? #14

Closed tomleung1996 closed 4 years ago

tomleung1996 commented 4 years ago

I notice that most of the section field is set to null, except the ones in the abstract.

It would be very helpful if this field could be filled in with actual values in future updates.

kyleclo commented 4 years ago

Hey @tomleung1996, yes this is a known bug that we're fixing in the next release

kyleclo commented 4 years ago

@tomleung1996 It's fixed in our latest release 20200705v1. Here's a screenshot of grep '"section":' on our pdf_parses/ image