allenai / scidocs

Dataset accompanying the SPECTER model
Other
127 stars 18 forks source link

SciDocs - The Dataset Evaluation Suite for SPECTER

SPECTER Public API | SPECTER Code Base | Paper

This repository contains code, link to data, and instructions to use the SciDocs evaluation suite.

Use SciRepEval Instead

As of December 2022, the Semantic Scholar research team recommends that you use SciRepEval instead of SciDocs for your embedding testing needs. It includes most of the SciDocs tasks as a subset, but has many more diverse tasks (classification, regression, search, nearest neighbors), a number of which are much larger and designed for training instead of evaluation.

Installation

To install this package, run the following:

git clone https://github.com/allenai/scidocs.git
cd scidocs
conda create -y --name scidocs python==3.7
conda activate scidocs
conda install -y -q -c conda-forge numpy pandas scikit-learn=0.22.2 jsonlines tqdm sklearn-contrib-lightning pytorch
pip install pytrec_eval awscli allennlp==0.9 overrides==3.1.0
python setup.py install

To obtain the data, run this command after the package is installed (from inside the scidocs folder):
[Expected download size is: 4.6 GiB]

aws s3 sync --no-sign-request s3://ai2-s2-research-public/specter/scidocs/ data/

Note: if you're having issues, make sure you're in the us-west-2 region.

Windows Support

Because pytrec_eval does not support Windows, you won't be able to install SciDocs on Windows. The data, however, is still accessible via the awscli command above.

How to run SciDocs

To obtain SciDocs metrics, you must first embed each entry in the 3 metadata files:

The embeddings must then reside in jsonl files with one json entry embedding per line, which will look something like this: {"paper_id": "0dfb47e206c762d2f4caeb99fd9019ade78c2c98", "embedding": [-3, -6, 0, ..., 2]}

We include the SPECTER embeddings as well. Here is how to reproduce the results in the SPECTER paper:

Once you have these 3 embedding files you can get all of the relevant metrics as follows:

from scidocs import get_scidocs_metrics
from scidocs.paths import DataPaths

# point to the data, which should be in scidocs/data by default
data_paths = DataPaths()

# point to the included embeddings jsonl
classification_embeddings_path = 'data/specter-embeddings/cls.jsonl'
user_activity_and_citations_embeddings_path = 'data/specter-embeddings/user-citation.jsonl'
recomm_embeddings_path = 'data/specter-embeddings/recomm.jsonl'

# now run the evaluation
scidocs_metrics = get_scidocs_metrics(
    data_paths,
    classification_embeddings_path,
    user_activity_and_citations_embeddings_path,
    recomm_embeddings_path,
    val_or_test='test',  # set to 'val' if tuning hyperparams
    n_jobs=12,  # the classification tasks can be parallelized
    cuda_device=-1  # the recomm task can use a GPU if this is set to 0, 1, etc
)

print(scidocs_metrics)

And you should see the following output:

{'mag': {'f1': 81.95}, 'mesh': {'f1': 86.44}, 'co-view': {'map': 83.63, 'ndcg': 91.5}, 'co-read': {'map': 84.46, 'ndcg': 92.39}, 'cite': {'map': 88.3, 'ndcg': 94.88}, 'co-cite': {'map': 88.11, 'ndcg': 94.77}, 'recomm': {'adj-NDCG': 53.9, 'adj-P@1': 20.0}}

Which matches exactly the last row of Table 1 in the SPECTER paper. Your results should be identical, with the exception of recomm due to a lack of reproducibility guarantees from PyTorch: https://pytorch.org/docs/stable/notes/randomness.html.

To run your own models, you need to generate your own embedding jsonl files. To tune hyperparameters, you can set the val_or_test='val' in the get_scidocs_metrics function and use the resulting values as part of your objective function.

Wrapper

To use SciDocs from command line you can use the provided wrapper:

python scripts/run.py \
--cls data/specter-embeddings/cls.jsonl \
--user-citation data/specter-embeddings/user-citation.jsonl \
--recomm data/specter-embeddings/recomm.jsonl \
--val_or_test test \
--n-jobs 12 \
--cuda-device -1

MAG and MeSH Class Labels

The MAG and MeSH datasets included with this repo have integer labels starting at 0. Here is how these integers map to class names.

For MeSH:

0    Cardiovascular diseases
1    Chronic kidney disease
2    Chronic respiratory diseases
3    Diabetes mellitus
4    Digestive diseases
5    HIV/AIDS
6    Hepatitis A/B/C/E
7    Mental disorders
8    Musculoskeletal disorders
9    Neoplasms (cancer)
10   Neurological disorders

And for MAG:

0   Art
1   Biology
2   Business
3   Chemistry
4   Computer science
5   Economics
6   Engineering
7   Environmental science
8   Geography
9   Geology
10  History
11  Materials science
12  Mathematics
13  Medicine
14  Philosophy
15  Physics
16  Political science
17  Psychology
18  Sociology

Errata

The SPECTER paper erroneously lists the number of training examples in the recommendations task as 20K. The correct value is 4K.

Citation

Please cite the SPECTER paper as:

@inproceedings{specter2020cohan,
  title={SPECTER: Document-level Representation Learning using Citation-informed Transformers},
  author={Arman Cohan and Sergey Feldman and Iz Beltagy and Doug Downey and Daniel S. Weld},
  booktitle={ACL},
  year={2020}
}