Closed XuhuiZhou closed 4 years ago
hey @XuhuiZhou, thanks for your interest! If you download the dataset, you'll see the ID of the paper that's being linked for each bibliography entry is provided in a field called "links". For example, we can see in the first paper in 0.jsonl.gz
:
"BIBREF28": {
"ref_id": "b28",
"title": "Spherical-spline parameterization of three-dimensional Earth 532 models",
"authors": [
{
"first": "Z",
"middle": [],
"last": "Wang",
"suffix": ""
},
{
"first": "F",
"middle": [
"A"
],
"last": "Dahlen",
"suffix": ""
}
],
"year": 1995,
"venue": "Geophys. Res. Lett",
"volume": "22",
"issn": "22",
"pages": "3099--3102",
"other_ids": {
"doi": [
"10.1029/95GL03080"
]
},
"links": "129880072"
},
@kyleclo hi! If you allow me to participate in your discussion. As I understand the question is if we have a paper A cited paper b in paper A we have just as mentioned above(the entry of bib entries only has a title but not the id of the papers "my guess id cited papers in paper A, with id he means the number of files in dataset ") if he wants to get the abstract of paper b should he search in the whole dataset? @XuhuiZhou right? since in your paper mentioned that, I just quote(The bibliography entries are then linked to one of 81.1M candidate papers. ) how do you represent this link in json file?
ok, my next question please, are all .jsonl files in your dataset have a standard structure? If you can please share us with the skeleton I will be grateful
Yeah, that should definitely be part of the question. Searching through the whole dataset could be pretty hard, I am wondering if we could possibly know which id range should appear in which dataset?
Also, I did not understand one thing if someone can explain in your repo mentioned that " The full corpus consists of 10000 zipped files" in your paper " We introduce the Semantic Scholar Graph of References in Context (GORC),a large contextual citation graph of 81.1M academic publications, including parsed full text for 8.1M open access papers, across broad domains of science. " as I understand That every .jsonl file represents one paper right? Based on that you introduce just 10000 academic publications and not 81.1M academic publications if I miss something please let me understand thanks!
From what I know, one Json file includes thousands of papers at least.
yes understand for example 1.jsonl has 8100 paper id! hmmmm then you can search the abstract of your required paper depending on paper id noticing that not every paper has an abstract
Yeah, there's a ton of papers as part of the full dataset. Each JSONL file consists of thousands of papers, one per line. Each paper has a unique identifier (i.e. "paper_id").
As I understood the original question in this thread: "However, the entry of bib entries only has a title but not the id of the papers", I'm pretty sure this shouldn't be true, unless there's a bug in the file. As seen in the example I posted above, each bibliography entry has a
If you aren't seeing a "links" field in bibliography entries, that's a bug. Can you point me to which JSONL file and which "paper_id" contains this?
@XuhuiZhou, I was wondering if this is resolved?
@XuhuiZhou, I was wondering if this is resolved?
Yes! It is resolved, thanks for the response.
Hi, Just want to make sure I understand things right. According to your paper, it seems that bibliography entries serve as a way to get the papers cited by paper (let's say paper A). However, the entry of bib entries only has a title but not the id of the papers. So if I want to get the content (for example abstract) of papers cited by paper A, the only way I could do is to use the title to search in the whole dataset?