allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

Logo of S2ORC, pronounced stork

S2ORC: The Semantic Scholar Open Research Corpus

S2ORC is a general-purpose corpus for NLP and text mining research over scientific papers.

News and Releases

S2ORC now available through S2 API

It's Jan 2023; happy new year! After years of managing S2ORC as a research project, it has now been adopted as a core dataset offering through the Semantic Scholar Public API. Please look for the instructions under "Bulk Dataset" for download!

S2ORC is now available through the Semantic Scholar Public API as a "Bulk Dataset". It is continuously being rebuilt so if you access it through there, you'll get access to new papers as well!

Software Release: 2021-02-01

S2ORC Release: 2020-07-05

Project Status: 2020-04-07

S2ORC Release: 2019-09-28

Download instructions

The original S2ORC dataset files were refactored into multiple datasets available through the Semantic Scholar APIs (See detailed documentation here).

Once you obtain an API key from Semantic Scholar Public API, you should be able to access these bulk dumps like so:

import json
import os
import re
import requests
import wget
from tqdm import tqdm

# modify these
API_KEY = "..."
DATASET_NAME = "s2orc"
LOCAL_PATH = "/my/local/path/for/s2orc/"
os.makedirs(LOCAL_PATH, exist_ok=True)

# get latest release's ID
response = requests.get("https://api.semanticscholar.org/datasets/v1/release/latest").json()
RELEASE_ID = response["release_id"]
print(f"Latest release ID: {RELEASE_ID}")

# get the download links for the s2orc dataset; needs to pass API key through `x-api-key` header
# download via wget. this can take a while...
response = requests.get(f"https://api.semanticscholar.org/datasets/v1/release/{RELEASE_ID}/dataset/{DATASET_NAME}/", headers={"x-api-key": API_KEY}).json()
for url in tqdm(response["files"]):
    match = re.match(r"https://ai2-s2ag.s3.amazonaws.com/staging/(.*)/s2orc/(.*).gz(.*)", url)
    assert match.group(1) == RELEASE_ID
    SHARD_ID = match.group(2)
    wget.download(url, out=os.path.join(LOCAL_PATH, f"{SHARD_ID}.gz"))
print("Downloaded all shards.")

For questions, feature requests, bug reports, please search existing issues on the s2-folks Github repo before creating a new issue.

Contact us

The best way to contact us is through email. Don't hesitate to reach out about anything; we've helped a lot of people get started with the dataset, which can be a bit daunting given its size.

Email: Please include {kylel, lucyw, rodneyk on all correspondence.

Twitter @kylelostat, @lucyluwang

Give us Feedback: Totally optional, but we'd love to hear how you're using this dataset & any feedback for improving it. Send us an email or leave a Github Issue.

Report issues:

S2ORC is now being maintained by the S2 API product team. For questions, feature requests, bug reports, please search existing issues on the s2-folks Github repo before creating a new issue.

FAQ

What's the difference between S2ORC and S2AG?

At a high level:

If you're unsure what to use or cite, please email us and we'd be happy to discuss your project with you.

I have an old version of S2ORC. How is it different from the version of S2ORC from the S2 API?

License

S2ORC is currently released through the Semantic Scholar Public API under the ODC-By 1.0. By using S2ORC, you are agreeing to its usage terms.

Citation

If using this dataset, please cite:

@inproceedings{lo-wang-2020-s2orc,
    title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
    author = "Lo, Kyle  and Wang, Lucy Lu  and Neumann, Mark  and Kinney, Rodney  and Weld, Daniel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.447",
    doi = "10.18653/v1/2020.acl-main.447",
    pages = "4969--4983"
}