allenai / scicite

Repository for NAACL 2019 paper on Citation Intent prediction
Apache License 2.0
115 stars 18 forks source link

SciCite

PWC PWC PWC

This repository contains datasets and code for classifying citation intents in academic papers.
For details on the model and data refer to our NAACL 2019 paper: "Structural Scaffolds for Citation Intent Classification in Scientific Publications".

Data

We introduce SciCite a new large dataset of citation intents. Download from the following link:

scicite.tar.gz (22.1 MB)

The data is in the Jsonlines format (each line is a json object).
The main citation intent label for each Json object is spacified with the label key while the citation context is specified in with a context key. Example entry:

{
 'string': 'In chacma baboons, male-infant relationships can be linked to both
    formation of friendships and paternity success [30,31].'
 'sectionName': 'Introduction',
 'label': 'background',
 'citingPaperId': '7a6b2d4b405439',
 'citedPaperId': '9d1abadc55b5e0',
 ...
 }

You may obtain the full information about the paper using the provided paper ids with the Semantic Scholar API.

We also run experiments on a pre-existing dataset of citation intents in the computational linguistics domain (ACL-ARC) introduced by Jurgens et al., (2018). The preprocessed dataset is available at ACL-ARC data.

Setup

The project needs Python 3.6 and is based on the AllenNLP library.

Setup an environment manually

Use pip to install dependencies in your desired python environment

pip install -r requirements.in -c constraints.txt

Running a pre-trained model on your own data

Download one of the pre-trained models and run the following command:

allennlp predict [path-to-model.tar.gz] [path-to-data.jsonl] \
--predictor [predictor-type] \
--include-package scicite \
--overrides "{'model':{'data_format':''}}"

Where

If you are using your own data, you need to first convert your data to be according to the SciCite data format.

Pretrained models

We also release our pretrained models; download from the following path:

Training your own models

First you need a config file for your training configuration. Check the experiment_configs/ directory for example configurations. Important options (you can specify them with environment variables) are:

  "train_data_path":  # path to training data,
  "validation_data_path":  #path to development data,
  "test_data_path":  # path to test data,
  "train_data_path_aux": # path to the data for section title scaffold,
  "train_data_path_aux2": # path to the data for citation worthiness scaffold,
  "mixing_ratio": # parameter \lambda_2 in the paper (sensitivity of loss to the first scaffold)
  "mixing_ratio2": # parameter \lambda_3 in the paper (sensitivity of loss to the second scaffold)

After downloading the data, edit the configuration file with the correct paths. You also need to pass in an environment variable specifying whether to use ELMo contextualized embeddings.

export elmo=true

Note that with elmo training speed will be significantly slower.

After making sure you have the correct configuration file, start training the model.

python scripts/train_local.py train_multitask_2 [path-to-config-file.json] \
-s [path-to-serialization-dir/] 
--include-package scicite

Where the model output and logs will be stored in [path-to-serialization-dir/]

Citing

If you found our dataset, or code useful, please cite Structural Scaffolds for Citation Intent Classification in Scientific Publications.

@InProceedings{Cohan2019Structural,
  author={Arman Cohan and Waleed Ammar and Madeleine Van Zuylen and Field Cady},
  title={Structural Scaffolds for Citation Intent Classification in Scientific Publications},
  booktitle="NAACL",
  year="2019"
}