allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
144 stars 25 forks source link

Q: Hands-on example how to download the dataset? #110

Closed sorenwacker closed 11 months ago

sorenwacker commented 1 year ago

All I have found so far the API description which is rather technical. Is there an example, how to download the entire dataset e.g. with Python somewhere?

sorenwacker commented 1 year ago

Also, some of the example given here are not returning the described results: https://api.semanticscholar.org/api-docs#tag/Paper-Data/operation/post_graph_get_papers

sorenwacker commented 1 year ago

This one returns {"message":"Too Many Requests"}':

https://api.semanticscholar.org/graph/v1/paper/search?query=covid+vaccination&offset=100&limit=3

sorenwacker commented 1 year ago

I tried to download the corpus with the aws client, but that did not work:

ws s3 cp s3://ai2-s2-research-public/open-corpus/2023-05-23/ test --no-sign-request
fatal error: An error occurred (404) when calling the HeadObject operation: Key "open-corpus/2023-05-23/" does not exist
sorenwacker commented 1 year ago

This (https://api.semanticscholar.org/api-docs/graph) web site mentions the need for an api-key, but does not provide info on how to obtain one:


Academic Graph API (1.0)
Download OpenAPI specification:[Download](https://api.semanticscholar.org/graph/v1/swagger.json)

Fetch paper and author data from the Semantic Scholar Academic Graph (S2AG).

Some things to note:

If you are using an API key, it must be set in the header x-api-key (case-sensitive).
We have two different IDs for a single paper:
paperId - string - The primary way to identify papers when using our website or this API
corpusId - int64 - A second way to identify papers. Our datasets use corpusId when pointing to papers.
Other useful resources
[Overview](https://www.semanticscholar.org/product/api)
[allenai/s2-folks](https://github.com/allenai/s2-folks/)
[FAQ](https://github.com/allenai/s2-folks/blob/main/FAQ.md) in allenai/s2folks
sorenwacker commented 1 year ago

Here is the most comfortable way of downloading the dataset (at least I think it is the correct dataset), using the huggingface API:

from datasets import load_dataset
dataset = load_dataset("allenai/s2orc")
yvonne-chou commented 1 year ago

This one returns {"message":"Too Many Requests"}':

https://api.semanticscholar.org/graph/v1/paper/search?query=covid+vaccination&offset=100&limit=3

You are hitting our unauthenticated user rate limit. You can request an API key from this page. https://www.semanticscholar.org/product/api

yvonne-chou commented 1 year ago

All I have found so far the API description which is rather technical. Is there an example, how to download the entire dataset e.g. with Python somewhere?

Here's the direct link to documentation for our Dataset. https://api.semanticscholar.org/api-docs/datasets

antonkulaga commented 11 months ago

I requested the API key two weeks ago, but nobody answered. I just want to download it and it seems to be open-access anyway, maybe you can just release it somewhere on torrents if you worry about paying amazon credits for traffic?

Jgordo72 commented 11 months ago

Hi Anton, We sent your key on June 27th. I am resending via email, so please check your junk mail folder if you do not receive it.

antonkulaga commented 11 months ago

@Jgordo72 thank you very much, I got the key yesterday, downloading s2org now.