allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

Where to look for the citations? #11

Closed romankht84 closed 4 years ago

romankht84 commented 4 years ago

Thumbs up for doing such a great job. Can you please help me by giving little explanation of the citation context. As they do not appear in almost 90% of the papers (only 8.1M have full paper and context information). Can you explain it by an example that how paper 2 was cited by paper 1 in which context?

tomleung1996 commented 4 years ago

Hi romankht84, I am also interested in this problem. However, I found that we could only obtain the citation context manually as this dataset only provides cite_spans. Do you find any other solutions?

romankht84 commented 4 years ago

Hi tomleung,

I have found one other dataset but that is very small: https://ordo.open.ac.uk/articles/Citation_Knowledge_based_Dataset/10673132

They have also provided one very large dataset named C2D but that one is not annotated with the citation context. I shall let you know in case there is any progress on my part. Keep in touch plz.

tomleung1996 commented 4 years ago

Thank you, @romankht84.

Regarding the citation context corpus, I do know a few. Besides C2D and the small dataset you mentioned, unarXive is also a recent one that worths attention. https://zenodo.org/record/3385851#.Xl5q7oG-uhA

I am a PhD student in Information Science. Would you please give me your email or other contacts of you so that we can communicate? You can find my email on my GitHub profile.

romankht84 commented 4 years ago

Thank you, @tomleung1996. I have sent you an email from my account.

kyleclo commented 4 years ago

Sorry about missing this. Yes, we provide citation mentions for full text papers. For papers without full text, there's no way to provide citations. We expect users to perform their own sentence splitting.

romankht84 commented 4 years ago

"we provide citation mentions for full text papers", the same is mentioned in your paper as well. But unfortunately, the files I open are missing it. Can you please give an example file number in which I can see it.

"there's no way to provide citations", I can agree with that, no doubt.

kyleclo commented 4 years ago

Here's an example using the first shard 0.jsonl. If you call head -1 0.jsonl | jq '.grobid_parse' | head -30 the output is:

{
  "abstract": [],
  "body_text": [
    {
      "text": "solution to this is to evaluate long wavelength, very-long-period (VLP) data that are relativelyFuego is a 3800 m stratovolcano that regularly produces Strombolian and weak 76Vulcanian explosions. The dynamics of these explosive events have been examined in the VLP 77 band [Lyons and Waite, 2011] and modeled together with infrasound and gas emission data. At 78 least three different styles of VLP event have been observed and attributed to eruptions from 79 either the summit vent or a flank vent [Waite et al., 2013] . The strongest recorded explosions 80 generated impulsive infrasound and seismic signals, ejected incandescent bombs and tephra, and 81were associated with repetitive VLP seismicity. The previous studies of Fuego VLP events 82 focused on periods from 30-10 seconds, where the influence of ground tilt is negligible given the 83 distances to the source and relatively short VLP wavelengths. Although the station geometry was 84 somewhat limited by the logistical and safety considerations, Lyons and Waite [2011] found that 85 the data best fit a source with a centroid 300 m below and 300 m to the west of the summit with a 86 moment tensor representative of primarily a dipping crack. The 30-10 second VLP captures the 87 inflation-deflation-reinflation cycle of this portion of the conduit as a small eruption occurs. 88 Lyons et al. [2012] examined the tilt signal associated with these same small explosions 89 at periods below the instrument corners. They found a significant tilt signal beginning up to 30 90 minutes prior to explosive eruptions. Forward modeling of the tilt from stations that were close 91 enough to record it suggested a shallow source midway between the VLP source centroid and the 92 summit. A full waveform inversion was not attempted. 93In this study, we perform full waveform inversions of stacks of events associated with 94 summit vent explosions in periods from 400 -10 seconds using a combined rotation-translation 95 approach similar to that of Maeda et al. [2011] . Inversions were performed in different bands to 96 explore the increasing influence of tilt with increasing period. While events with periods of 100s 97 of seconds are sometimes called Ultra-Long-Period events [e.g., Johnson et al., 2009], we simply 98 use the term VLP to cover the range of periods we investigate here. To improve the signal to 99 noise ratio, and ensure a representative dataset, inversions were performed on a set of phase-100 weighted, stacked seismograms from six explosions. The cleaner signals that resulted from 101 stacking also allowed for a larger number of seismic channels to be used than in previous studies. 102In order to constrain the uncertainty on the source type, we performed a nonlinear inversion for 103 moment tensor source type. This involves a grid search over all possible moment tensor types 104 and orientations at the best-fit centroid location, providing quantitative constraint on the source 105 type.",
      "cite_spans": [
        {
          "start": 274,
          "end": 297,
          "text": "[Lyons and Waite, 2011]",
          "latex": null,
          "ref_id": "BIBREF9"
        },
        {
          "start": 500,
          "end": 520,
          "text": "[Waite et al., 2013]",
          "latex": null,
          "ref_id": "BIBREF27"
        },
        {
          "start": 1027,
          "end": 1033,
          "text": "[2011]",
          "latex": null,
          "ref_id": null
        },

The cite_spans field indexes into the text field. So you can see that in this example, the 274th to 297th characters indicate the citation mention [Lyons and Waite, 2011] and they are linked to the bibliography entry BIBREF9.

romankht84 commented 4 years ago

I guess I could not explain my question. I see the details you have provided above in lots of examples. What I was interested in is to know the citation reason/motivation/function. To know the purpose/function of citing a paper. This is available in SciCite dataset with key 'lable'.

kyleclo commented 4 years ago

Ah yes, citation intent would require additional semantic annotation on top of the raw citation contexts that we provide.

If you're looking for annotations regarding citation intent, I recommend you take a look at Cohan et al https://arxiv.org/abs/1904.01608 and Jurgens et al https://transacl.org/ojs/index.php/tacl/article/view/1266