allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
188 stars 29 forks source link

Bug: Unexpected behaviour: most "citations" fields empty via API batch endpoint #199

Closed timwoelfle closed 4 months ago

timwoelfle commented 5 months ago

Dear Semantic Scholar team

I'm an academic using Semantic Scholar personally and for my free and open source tool "Local Citation Network" (https://localcitationnetwork.github.io/). I've noticed some weird behaviour with the API batch endpoint which I believe may be a bug.

Now here's the issue: Nearly all articles in the response have an empty array in the "citations" field, even though their "citationCount" numbers are positive:

> response.map(x => x.citations.length)
Array(55) [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ]
> response.map(x => x.citationCount)
Array(55) [ 148, 375, 86, 18, 471, 142, 56, 73, 148, 280, … ]

Only 2 fields (indices 27 and 42) actually had citations, interestingly quite many: 8667 and 1332, respectively (way more than the "up to 1000 will be returned" mentioned in the documentation on https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data/operation/post_graph_get_papers).

I believe this is a bug. I've attached two response jsons (I tried it twice and it was reproducible). The request details of the API call are below.

I hope you can reproduce, track down, and fix this bug. Let me know if you have any questions! Thanks and best,

Tim Woelfle

Request header: POST https://api.semanticscholar.org/graph/v1/paper/batch?fields=title,venue,year,externalIds,abstract,referenceCount,citationCount,publicationTypes,publicationDate,journal,authors.externalIds,authors.name,authors.affiliations,references.paperId,tldr,citations.paperId

Request body (list of 55 ids): {"ids":["5c6a907a418896b8aee17663e8c87895c1622fd3","f7014c1b0b2e820ba82a017924590f3098b49910","0e56e9006d1a992de243e129025a000f3bc791a4","434ca529b68aabfb4835ac2cb8a8a3da6f83efe1","2dba24d0ae646a9562d1bdef3b2605325e65dc0f","14a62330576422c5e984be619299206110bacefb","ac2f7ce4fd521c11d3654b85839b96ef41a0f287","88b80d9466a4fb941c2b5b463dba1e2a4f23ebf4","505e022f19daaf96a59040e72c7194599c219af7","296bc78c86d17481e9b8983632773f3c5666b2af","dcb70f058a5db720462641b5090235b66cbb18ae","c2dae083b5d082978b1994dc79c19d32a0b3274a","a4b6e12005d58e512712405b351ae128b5f9300f","0b999eb051fdefda6ec5efae076dad7d138287c8","d3d718f3f0e4e6d91b3b13524b3e90496e76f841","67ad40cb40d7784d5543bf6166b55ef2dfb37ea1","c9b832926aef3e37c81fc5f1ced7e853a6cae6a1","12334afe89c06c07a1409d4442a1d51c26e10d93","d4a22bb96196c2e2df704a162522c53678091bb3","c706f5b8184a145e4f9d6ffbd62f6757c3badc3e","5cdc695ab97a720e468d28868528c785fbd8a114","9e2b5146d43268cde0a223c4ebafead8b63d7528","ed8713ca0d4e263cbb12c0da16fe56d6abc732fd","6c69a425959a4e98df944c10300258f18119c3b7","e2e16f3c123850dffbb38765ef8fd71ebaecdaca","e43256238dbfcf1fe37aac918a6d2d033e22d380","922cd02a5e4f1298384cb5b9f6d13df5daf64b70","467f0fdc420f5cd8996c0b2b1eb33a3dcda93c5e","5595d6e87417ba69831cd6da96e063b0a7ea373b","234590c3c737fe38ab3632f4a86a195462c547a7","963c95a977e4ce253791a7683ee19d91514a2002","04f4c68fa7bc5c9ea550076bb911b68b052d28a7","9758a5cb826ed7199ee8822f08108fd6bbf7a106","a010b76e1a809da5a128def075a310b1b1511593","045555ec4342da07074949f540bc615cb8c453cf","0de477d496b226525e56d2e6591a7721697dc2a8","c5e8862bfd224b8078a77655602c910963df75d3","be3eda717b99731f93de80d75031f38e40f84cee","4e16328e599e9d3169f40b6dbfbd039b4ca673a2","4d41108590a7823ea9b943bd4c614534edba3b8d","df6ade47d3bbab757e8fcf6b3f026b7d3d44ed01","37a09c5884e85fb6481e8bbd06724fa5ab293a39","389ec712b590cba24a184aa9704bcfad0970f1b0","706a40b0d9e7c046fa206124b78f25117f3e86af","70fae121f412c19612115eff06c13134b8cb2060","2b0dd59254ed9d1255f817e427ede2c9f53e5e5f","7411530aa26843b62f9174fa9d004bab72e476dc","8766710c66d2a93541f61003d2d2562573636f2d","fedd542a6c24f5dc2fc3b5cf8391326a605ccf85","df45f6e2d2e3a4abac857b914cae703f225957a0","5c2ae3bf77fbca2feb457e60861232af41b44403","53564c45fe0af4889c92f05b04626f7ac739a97a","d5237678e6d12e95bf989f7972fc065cc3800d55","ccb5d68fc4aef32b84fcaf409b0b672c46a2bd51","585bf445ec84c1d9621b2726bdcce9f544b515c8"]}

API calls were performed via fetch on Firefox 126 24-06-01-S2-response-Boulton-2021-rerun.json on Ubuntu 24-06-01-S2-response-Boulton-2021.json

timwoelfle commented 5 months ago

Here's reproducible code in python:

import requests

api_url = 'https://api.semanticscholar.org/graph/v1/paper/batch?fields=title,venue,year,externalIds,abstract,referenceCount,citationCount,publicationTypes,publicationDate,journal,authors.externalIds,authors.name,authors.affiliations,references.paperId,tldr,citations.paperId'
payload = {"ids": ["5c6a907a418896b8aee17663e8c87895c1622fd3", "f7014c1b0b2e820ba82a017924590f3098b49910", "0e56e9006d1a992de243e129025a000f3bc791a4", "434ca529b68aabfb4835ac2cb8a8a3da6f83efe1", "2dba24d0ae646a9562d1bdef3b2605325e65dc0f", "14a62330576422c5e984be619299206110bacefb", "ac2f7ce4fd521c11d3654b85839b96ef41a0f287", "88b80d9466a4fb941c2b5b463dba1e2a4f23ebf4", "505e022f19daaf96a59040e72c7194599c219af7", "296bc78c86d17481e9b8983632773f3c5666b2af", "dcb70f058a5db720462641b5090235b66cbb18ae", "c2dae083b5d082978b1994dc79c19d32a0b3274a", "a4b6e12005d58e512712405b351ae128b5f9300f", "0b999eb051fdefda6ec5efae076dad7d138287c8", "d3d718f3f0e4e6d91b3b13524b3e90496e76f841", "67ad40cb40d7784d5543bf6166b55ef2dfb37ea1", "c9b832926aef3e37c81fc5f1ced7e853a6cae6a1", "12334afe89c06c07a1409d4442a1d51c26e10d93", "d4a22bb96196c2e2df704a162522c53678091bb3", "c706f5b8184a145e4f9d6ffbd62f6757c3badc3e", "5cdc695ab97a720e468d28868528c785fbd8a114", "9e2b5146d43268cde0a223c4ebafead8b63d7528", "ed8713ca0d4e263cbb12c0da16fe56d6abc732fd", "6c69a425959a4e98df944c10300258f18119c3b7", "e2e16f3c123850dffbb38765ef8fd71ebaecdaca", "e43256238dbfcf1fe37aac918a6d2d033e22d380", "922cd02a5e4f1298384cb5b9f6d13df5daf64b70", "467f0fdc420f5cd8996c0b2b1eb33a3dcda93c5e", "5595d6e87417ba69831cd6da96e063b0a7ea373b", "234590c3c737fe38ab3632f4a86a195462c547a7", "963c95a977e4ce253791a7683ee19d91514a2002", "04f4c68fa7bc5c9ea550076bb911b68b052d28a7", "9758a5cb826ed7199ee8822f08108fd6bbf7a106", "a010b76e1a809da5a128def075a310b1b1511593", "045555ec4342da07074949f540bc615cb8c453cf", "0de477d496b226525e56d2e6591a7721697dc2a8", "c5e8862bfd224b8078a77655602c910963df75d3", "be3eda717b99731f93de80d75031f38e40f84cee", "4e16328e599e9d3169f40b6dbfbd039b4ca673a2", "4d41108590a7823ea9b943bd4c614534edba3b8d", "df6ade47d3bbab757e8fcf6b3f026b7d3d44ed01", "37a09c5884e85fb6481e8bbd06724fa5ab293a39", "389ec712b590cba24a184aa9704bcfad0970f1b0", "706a40b0d9e7c046fa206124b78f25117f3e86af", "70fae121f412c19612115eff06c13134b8cb2060", "2b0dd59254ed9d1255f817e427ede2c9f53e5e5f", "7411530aa26843b62f9174fa9d004bab72e476dc", "8766710c66d2a93541f61003d2d2562573636f2d", "fedd542a6c24f5dc2fc3b5cf8391326a605ccf85", "df45f6e2d2e3a4abac857b914cae703f225957a0", "5c2ae3bf77fbca2feb457e60861232af41b44403", "53564c45fe0af4889c92f05b04626f7ac739a97a", "d5237678e6d12e95bf989f7972fc065cc3800d55", "ccb5d68fc4aef32b84fcaf409b0b672c46a2bd51", "585bf445ec84c1d9621b2726bdcce9f544b515c8"]}

# Make the API request
response = requests.post(api_url, json=payload)

# Check the response
if response.status_code == 200:
    data = response.json()
    # Ensure correct parsing of the API response
    if len(data):
        print("len(citations):", [len(item.get('citations', [])) for item in data])
        print("citationCount:", [item.get('citationCount', 0) for item in data])
    else:
        print("No data found in the response.")
else:
    print(f"Error: {response.status_code} - {response.text}")

Ouput:

len(citations): [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8675, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1324, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
citationCount: [148, 376, 86, 19, 473, 142, 56, 73, 148, 280, 289, 61, 361, 144, 24, 62, 101, 185, 163, 207, 179, 31, 3411, 28, 508, 867, 609, 8675, 411, 49, 405, 272, 699, 982, 69, 80, 3599, 190, 671, 821, 1214, 391, 3121, 1628, 247, 1205, 9033, 5, 3935, 3488, 521, 332, 2645, 664, 2308]

I would expect these numbers should match, right?

cfiorelli commented 4 months ago

Thank you @timwoelfle for the detailed report. I've tested and escalated this to the appropriate team.

cfiorelli commented 4 months ago

@timwoelfle I've found that we have an undocumented limit of 9999 citations results per request. In your example the system returned all of the results for paper corresponding to 8675 citations and 1324 results for the other corresponding paper. Any missing citations here were not possible to be returned. This limitation currently cannot be increased. Documentation is being updated.

Thanks so much for catching this and bringing it to our attention