allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

Data Integrity #38

Closed Hannibal046 closed 2 years ago

Hannibal046 commented 2 years ago

Hi, after downloading the full data from the link you emailed. I found that the set of paper_id in metadata_0.jsonl.gz do not equal to that of pdf_parses_0.jsonl.gz. Am I getting wrong ? is it possible that paper of the paper_id in metadata_0.jsonl.gz appear in the pdf_parses_x.jsonl.gz ? Thanks so much !

lucylw commented 2 years ago

@Hannibal046 The paper_id's are the same between each similarly numbered metadata and pdf_parse set. The metadata file will have many more paper_ids, since it includes papers where we do not have any full text. All entries with has_pdf_parse: True in the metadata entry will have a corresponding entry in the pdf_parse file.

For example, in metadata_0.jsonl.gz, there are 1366661 entries. Only 310736 of these have a PDF, all of which have corresponding entries in pdf_parses_0.jsonl.gz.

Please let me know if anything is still unclear

Hannibal046 commented 2 years ago

Thanks so much ! It solves all my problem