allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

How to get citing paper from bibliography entries #3

Closed XuhuiZhou closed 4 years ago

XuhuiZhou commented 4 years ago

Hi, Just want to make sure I understand things right. According to your paper, it seems that bibliography entries serve as a way to get the papers cited by paper (let's say paper A). However, the entry of bib entries only has a title but not the id of the papers. So if I want to get the content (for example abstract) of papers cited by paper A, the only way I could do is to use the title to search in the whole dataset?

37 "bib_entries": {
38 "BIBREF9": {
39 "ref_id": "b9",
40 "title": "Dynamics of exp...",
41 "authors": [{"first": "J"...}],
42 "year": 2011,
43 "venue": "J. Geophys. Res",
44 "volume": "116",
45 "issn": "B9",
46 "pages": "",
47 "other_ids": {"doi": ["10.1029/2
011jb008521"]}
48 }
kyleclo commented 4 years ago

hey @XuhuiZhou, thanks for your interest! If you download the dataset, you'll see the ID of the paper that's being linked for each bibliography entry is provided in a field called "links". For example, we can see in the first paper in 0.jsonl.gz:

      "BIBREF28": {
        "ref_id": "b28",
        "title": "Spherical-spline parameterization of three-dimensional Earth 532 models",
        "authors": [
          {
            "first": "Z",
            "middle": [],
            "last": "Wang",
            "suffix": ""
          },
          {
            "first": "F",
            "middle": [
              "A"
            ],
            "last": "Dahlen",
            "suffix": ""
          }
        ],
        "year": 1995,
        "venue": "Geophys. Res. Lett",
        "volume": "22",
        "issn": "22",
        "pages": "3099--3102",
        "other_ids": {
          "doi": [
            "10.1029/95GL03080"
          ]
        },
        "links": "129880072"
      },
Mayar2009 commented 4 years ago

@kyleclo hi! If you allow me to participate in your discussion. As I understand the question is if we have a paper A cited paper b in paper A we have just as mentioned above(the entry of bib entries only has a title but not the id of the papers "my guess id cited papers in paper A, with id he means the number of files in dataset ") if he wants to get the abstract of paper b should he search in the whole dataset? @XuhuiZhou right? since in your paper mentioned that, I just quote(The bibliography entries are then linked to one of 81.1M candidate papers. ) how do you represent this link in json file?

ok, my next question please, are all .jsonl files in your dataset have a standard structure? If you can please share us with the skeleton I will be grateful

XuhuiZhou commented 4 years ago

Yeah, that should definitely be part of the question. Searching through the whole dataset could be pretty hard, I am wondering if we could possibly know which id range should appear in which dataset?

Mayar2009 commented 4 years ago

Also, I did not understand one thing if someone can explain in your repo mentioned that " The full corpus consists of 10000 zipped files" in your paper " We introduce the Semantic Scholar Graph of References in Context (GORC),a large contextual citation graph of 81.1M academic publications, including parsed full text for 8.1M open access papers, across broad domains of science. " as I understand That every .jsonl file represents one paper right? Based on that you introduce just 10000 academic publications and not 81.1M academic publications if I miss something please let me understand thanks!

XuhuiZhou commented 4 years ago

From what I know, one Json file includes thousands of papers at least.

Mayar2009 commented 4 years ago

yes understand for example 1.jsonl has 8100 paper id! hmmmm then you can search the abstract of your required paper depending on paper id noticing that not every paper has an abstract

kyleclo commented 4 years ago

Yeah, there's a ton of papers as part of the full dataset. Each JSONL file consists of thousands of papers, one per line. Each paper has a unique identifier (i.e. "paper_id").

As I understood the original question in this thread: "However, the entry of bib entries only has a title but not the id of the papers", I'm pretty sure this shouldn't be true, unless there's a bug in the file. As seen in the example I posted above, each bibliography entry has a field called "Links" which provides the "paper_id" of the associated paper. If the bibliography entry is not linked because (i) the cited paper isn't in the corpus, or (ii) our linking failed to match it, then that will be Null value.

If you aren't seeing a "links" field in bibliography entries, that's a bug. Can you point me to which JSONL file and which "paper_id" contains this?

kyleclo commented 4 years ago

@XuhuiZhou, I was wondering if this is resolved?

XuhuiZhou commented 4 years ago

@XuhuiZhou, I was wondering if this is resolved?

Yes! It is resolved, thanks for the response.