allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
188 stars 29 forks source link

s2.api.get_paper is much slower than the rate limit #124

Closed rmovva closed 1 year ago

rmovva commented 1 year ago

Describe the bug I am using s2.api.get_paper to retrieve paper info for ~500K arXiv IDs. I received an API key earlier today, but when I pass in the key as an argument with api_key=S2_API_KEY, I am not able to retrieve papers at my assigned rate limit of 100 requests / second (it seems like I am still at the default public request rate).

To Reproduce e.g.

for arxiv_id in tq.tqdm(arxiv_ids, miniters=100):
    try:
        paper = s2.api.get_paper(
            paperId = arxiv_id,
            params = dict(include_unknown_references=True),
            retries = 2,
            wait = 1,
            api_key=S2_API_KEY,
        )

Expected behavior I expected to retrieve papers at ~100/s, but instead it's more like ~1/s.

cfiorelli commented 1 year ago

@rmovva Thanks for reaching out ~! Take a look at the api key header here, let us know if youre still running into trouble.

import requests
import tqdm

# List of arXiv IDs
arxiv_ids = ["arXiv:1703.10593", "arXiv:2001.01489"]

# Define your API key from Semantic Scholar
S2_API_KEY = 'YOUR KEY HERE'

# Splitting the arxiv_ids into chunks of 500 due to the API limitation
def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

BASE_URL = 'https://api.semanticscholar.org/graph/v1/paper/batch'
HEADERS = {
    'x-api-key': S2_API_KEY,
    'Content-Type': 'application/json'
}

# We'll loop through chunks of IDs and send batch requests
for chunk in tqdm.tqdm(list(chunks(arxiv_ids, 500))):
    response = requests.post(
        BASE_URL,
        headers=HEADERS,
        params={'fields': 'referenceCount,citationCount,title'},
        json={"ids": chunk}
    )

    if response.status_code == 200:
        papers = response.json()  # Assuming the response directly gives a list of papers
        for paper in papers:
            # Do something with each paper's details here
            print(paper['title'], paper['referenceCount'])
    else:
        print(f"Failed to fetch details for chunk: {chunk}")
        print("Status Code:", response.status_code)
        print("Response:", response.text)
rmovva commented 1 year ago

This worked (and ran very quickly), thanks!

Is there documentation somewhere on what attributes can be retrieved using the paper batch API? For example, I notice you have 'referenceCount,citationCount,title' here, but the S2Paper object has many other attributes: https://pys2.readthedocs.io/en/latest/api_reference/models/s2paper.html

However, attributes like citationVelocity don't seem to be available through these batch API calls -- do you know how I can get these other attributes / exactly which ones are available?