allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
144 stars 25 forks source link

Bug: Unexpected behaviour: most "citations" fields empty via API batch endpoint #199

Open timwoelfle opened 1 month ago

timwoelfle commented 1 month ago

Dear Semantic Scholar team

I'm an academic using Semantic Scholar personally and for my free and open source tool "Local Citation Network" (https://localcitationnetwork.github.io/). I've noticed some weird behaviour with the API batch endpoint which I believe may be a bug.

Now here's the issue: Nearly all articles in the response have an empty array in the "citations" field, even though their "citationCount" numbers are positive:

> response.map(x => x.citations.length)
Array(55) [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ]
> response.map(x => x.citationCount)
Array(55) [ 148, 375, 86, 18, 471, 142, 56, 73, 148, 280, … ]

Only 2 fields (indices 27 and 42) actually had citations, interestingly quite many: 8667 and 1332, respectively (way more than the "up to 1000 will be returned" mentioned in the documentation on https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data/operation/post_graph_get_papers).

I believe this is a bug. I've attached two response jsons (I tried it twice and it was reproducible). The request details of the API call are below.

I hope you can reproduce, track down, and fix this bug. Let me know if you have any questions! Thanks and best,

Tim Woelfle

Request header: POST https://api.semanticscholar.org/graph/v1/paper/batch?fields=title,venue,year,externalIds,abstract,referenceCount,citationCount,publicationTypes,publicationDate,journal,authors.externalIds,authors.name,authors.affiliations,references.paperId,tldr,citations.paperId

Request body (list of 55 ids): {"ids":["5c6a907a418896b8aee17663e8c87895c1622fd3","f7014c1b0b2e820ba82a017924590f3098b49910","0e56e9006d1a992de243e129025a000f3bc791a4","434ca529b68aabfb4835ac2cb8a8a3da6f83efe1","2dba24d0ae646a9562d1bdef3b2605325e65dc0f","14a62330576422c5e984be619299206110bacefb","ac2f7ce4fd521c11d3654b85839b96ef41a0f287","88b80d9466a4fb941c2b5b463dba1e2a4f23ebf4","505e022f19daaf96a59040e72c7194599c219af7","296bc78c86d17481e9b8983632773f3c5666b2af","dcb70f058a5db720462641b5090235b66cbb18ae","c2dae083b5d082978b1994dc79c19d32a0b3274a","a4b6e12005d58e512712405b351ae128b5f9300f","0b999eb051fdefda6ec5efae076dad7d138287c8","d3d718f3f0e4e6d91b3b13524b3e90496e76f841","67ad40cb40d7784d5543bf6166b55ef2dfb37ea1","c9b832926aef3e37c81fc5f1ced7e853a6cae6a1","12334afe89c06c07a1409d4442a1d51c26e10d93","d4a22bb96196c2e2df704a162522c53678091bb3","c706f5b8184a145e4f9d6ffbd62f6757c3badc3e","5cdc695ab97a720e468d28868528c785fbd8a114","9e2b5146d43268cde0a223c4ebafead8b63d7528","ed8713ca0d4e263cbb12c0da16fe56d6abc732fd","6c69a425959a4e98df944c10300258f18119c3b7","e2e16f3c123850dffbb38765ef8fd71ebaecdaca","e43256238dbfcf1fe37aac918a6d2d033e22d380","922cd02a5e4f1298384cb5b9f6d13df5daf64b70","467f0fdc420f5cd8996c0b2b1eb33a3dcda93c5e","5595d6e87417ba69831cd6da96e063b0a7ea373b","234590c3c737fe38ab3632f4a86a195462c547a7","963c95a977e4ce253791a7683ee19d91514a2002","04f4c68fa7bc5c9ea550076bb911b68b052d28a7","9758a5cb826ed7199ee8822f08108fd6bbf7a106","a010b76e1a809da5a128def075a310b1b1511593","045555ec4342da07074949f540bc615cb8c453cf","0de477d496b226525e56d2e6591a7721697dc2a8","c5e8862bfd224b8078a77655602c910963df75d3","be3eda717b99731f93de80d75031f38e40f84cee","4e16328e599e9d3169f40b6dbfbd039b4ca673a2","4d41108590a7823ea9b943bd4c614534edba3b8d","df6ade47d3bbab757e8fcf6b3f026b7d3d44ed01","37a09c5884e85fb6481e8bbd06724fa5ab293a39","389ec712b590cba24a184aa9704bcfad0970f1b0","706a40b0d9e7c046fa206124b78f25117f3e86af","70fae121f412c19612115eff06c13134b8cb2060","2b0dd59254ed9d1255f817e427ede2c9f53e5e5f","7411530aa26843b62f9174fa9d004bab72e476dc","8766710c66d2a93541f61003d2d2562573636f2d","fedd542a6c24f5dc2fc3b5cf8391326a605ccf85","df45f6e2d2e3a4abac857b914cae703f225957a0","5c2ae3bf77fbca2feb457e60861232af41b44403","53564c45fe0af4889c92f05b04626f7ac739a97a","d5237678e6d12e95bf989f7972fc065cc3800d55","ccb5d68fc4aef32b84fcaf409b0b672c46a2bd51","585bf445ec84c1d9621b2726bdcce9f544b515c8"]}

API calls were performed via fetch on Firefox 126 24-06-01-S2-response-Boulton-2021-rerun.json on Ubuntu 24-06-01-S2-response-Boulton-2021.json

timwoelfle commented 3 weeks ago

Here's reproducible code in python:

import requests

api_url = 'https://api.semanticscholar.org/graph/v1/paper/batch?fields=title,venue,year,externalIds,abstract,referenceCount,citationCount,publicationTypes,publicationDate,journal,authors.externalIds,authors.name,authors.affiliations,references.paperId,tldr,citations.paperId'
payload = {"ids": ["5c6a907a418896b8aee17663e8c87895c1622fd3", "f7014c1b0b2e820ba82a017924590f3098b49910", "0e56e9006d1a992de243e129025a000f3bc791a4", "434ca529b68aabfb4835ac2cb8a8a3da6f83efe1", "2dba24d0ae646a9562d1bdef3b2605325e65dc0f", "14a62330576422c5e984be619299206110bacefb", "ac2f7ce4fd521c11d3654b85839b96ef41a0f287", "88b80d9466a4fb941c2b5b463dba1e2a4f23ebf4", "505e022f19daaf96a59040e72c7194599c219af7", "296bc78c86d17481e9b8983632773f3c5666b2af", "dcb70f058a5db720462641b5090235b66cbb18ae", "c2dae083b5d082978b1994dc79c19d32a0b3274a", "a4b6e12005d58e512712405b351ae128b5f9300f", "0b999eb051fdefda6ec5efae076dad7d138287c8", "d3d718f3f0e4e6d91b3b13524b3e90496e76f841", "67ad40cb40d7784d5543bf6166b55ef2dfb37ea1", "c9b832926aef3e37c81fc5f1ced7e853a6cae6a1", "12334afe89c06c07a1409d4442a1d51c26e10d93", "d4a22bb96196c2e2df704a162522c53678091bb3", "c706f5b8184a145e4f9d6ffbd62f6757c3badc3e", "5cdc695ab97a720e468d28868528c785fbd8a114", "9e2b5146d43268cde0a223c4ebafead8b63d7528", "ed8713ca0d4e263cbb12c0da16fe56d6abc732fd", "6c69a425959a4e98df944c10300258f18119c3b7", "e2e16f3c123850dffbb38765ef8fd71ebaecdaca", "e43256238dbfcf1fe37aac918a6d2d033e22d380", "922cd02a5e4f1298384cb5b9f6d13df5daf64b70", "467f0fdc420f5cd8996c0b2b1eb33a3dcda93c5e", "5595d6e87417ba69831cd6da96e063b0a7ea373b", "234590c3c737fe38ab3632f4a86a195462c547a7", "963c95a977e4ce253791a7683ee19d91514a2002", "04f4c68fa7bc5c9ea550076bb911b68b052d28a7", "9758a5cb826ed7199ee8822f08108fd6bbf7a106", "a010b76e1a809da5a128def075a310b1b1511593", "045555ec4342da07074949f540bc615cb8c453cf", "0de477d496b226525e56d2e6591a7721697dc2a8", "c5e8862bfd224b8078a77655602c910963df75d3", "be3eda717b99731f93de80d75031f38e40f84cee", "4e16328e599e9d3169f40b6dbfbd039b4ca673a2", "4d41108590a7823ea9b943bd4c614534edba3b8d", "df6ade47d3bbab757e8fcf6b3f026b7d3d44ed01", "37a09c5884e85fb6481e8bbd06724fa5ab293a39", "389ec712b590cba24a184aa9704bcfad0970f1b0", "706a40b0d9e7c046fa206124b78f25117f3e86af", "70fae121f412c19612115eff06c13134b8cb2060", "2b0dd59254ed9d1255f817e427ede2c9f53e5e5f", "7411530aa26843b62f9174fa9d004bab72e476dc", "8766710c66d2a93541f61003d2d2562573636f2d", "fedd542a6c24f5dc2fc3b5cf8391326a605ccf85", "df45f6e2d2e3a4abac857b914cae703f225957a0", "5c2ae3bf77fbca2feb457e60861232af41b44403", "53564c45fe0af4889c92f05b04626f7ac739a97a", "d5237678e6d12e95bf989f7972fc065cc3800d55", "ccb5d68fc4aef32b84fcaf409b0b672c46a2bd51", "585bf445ec84c1d9621b2726bdcce9f544b515c8"]}

# Make the API request
response = requests.post(api_url, json=payload)

# Check the response
if response.status_code == 200:
    data = response.json()
    # Ensure correct parsing of the API response
    if len(data):
        print("len(citations):", [len(item.get('citations', [])) for item in data])
        print("citationCount:", [item.get('citationCount', 0) for item in data])
    else:
        print("No data found in the response.")
else:
    print(f"Error: {response.status_code} - {response.text}")

Ouput:

len(citations): [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8675, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1324, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
citationCount: [148, 376, 86, 19, 473, 142, 56, 73, 148, 280, 289, 61, 361, 144, 24, 62, 101, 185, 163, 207, 179, 31, 3411, 28, 508, 867, 609, 8675, 411, 49, 405, 272, 699, 982, 69, 80, 3599, 190, 671, 821, 1214, 391, 3121, 1628, 247, 1205, 9033, 5, 3935, 3488, 521, 332, 2645, 664, 2308]

I would expect these numbers should match, right?

cfiorelli commented 3 days ago

Thank you @timwoelfle for the detailed report. I've tested and escalated this to the appropriate team.