allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

Missing hierarchical information of section heads #37

Closed jacklxc closed 2 years ago

jacklxc commented 2 years ago

The current released version is greedily using the immediate/lowest-level section head as each paragraph's section head. For example, if there are any sub-sections or paragraph heads under the "Related Work" section, it becomes hard to extract the entire "Related Work" section using string matching of the section heads.

lucylw commented 2 years ago

Unfortunately this is the case in the current released version. In the updated s2orc-doc2json utility (which we use to create S2ORC JSON), we now preserve hierarchical section headers when possible (see here).

For future S2ORC releases, this will be standard. If you really need nested section headers currently, you could use s2orc-doc2json to reprocess those papers of interest to you. I know that's not the most satisfying answer, but hopefully provides some interim options.

jacklxc commented 2 years ago

Thank you, Lucy.