allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

parsing full dataset? #34

Closed hp0404 closed 3 years ago

hp0404 commented 3 years ago

before downloading the latest full release, I thought I'd clarify - I've noticed you have s2orc-doc2json library, so do I need to manually parse zipped files once I have full dataset or do you upload processed JSONL files that don't require any parsing?

thanks

lucylw commented 3 years ago

no, you do not need to do any additional parsing. the s2orc dataset consists of structured paper data and metadata.

the s2orc-doc2json library is made available so that you can process other documents into the same format as s2orc if you'd like.

Practicing7 commented 1 year ago

no, you do not need to do any additional parsing. the s2orc dataset consists of structured paper data and metadata.

the s2orc-doc2json library is made available so that you can process other documents into the same format as s2orc if you'd like.

Can you provide some samples of using this dataset?