allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

number citation #15

Closed zjhuang22 closed 4 years ago

zjhuang22 commented 4 years ago

Hi, thanks for the great work, I wonder is the number citation of a paper provided in the datasets? What I mean is, paper A, the number of times paper A cited by other papers.

lucylw commented 4 years ago

We provide a notion of inbound/outbound citations to/from papers within the dataset itself (see metadata). These citations are designed to help people identify papers where citation contexts may exist, and are not a very good representation of the "total" number of citations a paper has.

kyleclo commented 4 years ago

@zjhuang22 This is just several lines of code to compute for every paper:

paper_id_to_num_citations = {}
with open('full/metadata/metadata_0.jsonl') as f_in:
    for line in f_in:
        metadata_dict = json.loads(line)
        paper_id = metadata_dict['paper_id']
        paper_id_to_num_citations[paper_id] = len(metadata_dict['inbound_citations'])

but like Lucy said, the citation count will probably differ from what you see across many websites, like Google or Semantic Scholar. What we give you is the citation counts with respect to the subset of papers in our S2ORC collection for which we have citation information, not the true citation count of papers, which... I'm not sure anyone really knows.

That being said, I still just counted a total of 467M+ citation edges in the dataset, so hopefully that's enough coverage for you