allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
169 stars 28 forks source link

S2 API giving wrong results for some paper IDs (returning entirely different paper IDs than input) #108

Closed Siddharth-Gandhi closed 1 year ago

Siddharth-Gandhi commented 1 year ago

Describe the bug The batch API was working fine since my last issue (#59), however there is a very weird bug where if I give some X list of paper IDs as payload to the Batch API (same problem also exists in normal S2 API), it returns Y, an entirely different list of paper IDs in return and none of the paper IDs in Y are in X. This is very weird behaviour. Note that it does not seem to be the case for most paper IDs however for some this happens.

To Reproduce

not_working_ids = ['3e83d54c5e8dfba82638b4f75ace31505ea60ff0', '9dd051e6f842131196fee5cbc79b8e4511d577c2', '817aa71dd75abc01dedb24f806d69e8e97828a11', '16c232a9310860be9e9817cca875cd72d9ba50d4', '468c3b2bf358d07cc625b075f91595d825299948', '022dd244f2e25525eb37e9dda51abb9cd8ca8c30', '0d684d919652ab2506fc8ef0a2494a46c3f7abca', '21b770571687a483672894374065b93e246fd200', 'b281a8a5f9af12143b0813ebe65eac3e9971f316', 'bd33916225d23a8855a1e67ae73321d7b70fcd0c', '7cccee8c8a3807b1699b1b82bdaa8e5e66eb5d0f', 'bc1586a2e74d6d1cf87b083c4cbd1eede2b09ea5', '6e0cfc8a2e743e3a90ad089f0fd4e4985f2f6834', '0aea520a25198f6b3f385a09b158da2f7ec5cf1f', '7c53d9c66a8648abb060318e36be4266233c4c0c', '6e45220c1f3a8a8cbf176a2fc722c7e8380d5dd4', '98485ce6532d69f34a8ec67de6b09a39532bd221', 'dfc504536e8434eb008680343abb77010965169e']

working_ids = ["204e3073870fae3d05bcbc2f6a8e263d9b72e776", "bee044c8e8903fb67523c1f8c105ab4718600cdb", "36eff562f65125511b5dfab68ce7f7a943c27478", "8388f1be26329fa45e5807e968a641ce170ea078", "846aedd869a00c09b40f1f1f35673cb22bc87490", "e0e9a94c4a6ba219e768b4e59f72c18f0a22e23d", "fa72afa9b2cbc8f0d7b05d52548906610ffbb9c5", "424561d8585ff8ebce7d5d07de8dbf7aae5e7270", "4d376d6978dad0374edfa6709c9556b42d3594d3", "a6cb366736791bcccc5c8639de5a8f9636bf87e8", "df2b0e26d0599ce3e70df8a9da02e51594e0e992", "913f54b44dfb9202955fe296cf5586e1105565ea", "156d217b0a911af97fa1b5a71dc909ccef7a8028", "a3e4ceb42cbcd2c807d53aff90a8cb1f5ee3f031", "5c5751d45e298cea054f32b392c12c61027d2fe7", "bc1586a2e74d6d1cf87b083c4cbd1eede2b09ea5", "921b2958cac4138d188fd5047aa12bbcf37ac867", "cb92a7f9d9dbcf9145e32fdfa0e70e2a6b828eb1"]

print(f"Number of IDs: {len(working_ids)}")
r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'referenceCount,citationCount,title'},
    json={"ids": working_ids}
)
print(json.dumps(r.json(), indent=2))

This was the original code where I ran into the issue. Run the above code once with working_ids and once with non_working_ids. You will observe that for working_ids the results are exactly the same papers as needed above. However, for non_working_ids, an entirely different set of IDs is returned.

Minimal Example Asking for one paper ID, getting something else in return.

not_working_id = ['3e83d54c5e8dfba82638b4f75ace31505ea60ff0']
print(f"Number of IDs: {len(not_working_id)}")
r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'referenceCount,citationCount,title'},
    json={"ids": not_working_id}
)
print(json.dumps(r.json(), indent=2))

Expected behavior I would expect the API to return the same IDs as original input for all paper IDs (not like for some it works and for some it doesn't). In the case of the minimal example, input paper ID is 3e83d54c5e8dfba82638b4f75ace31505ea60ff0 but the result paper ID is ad21c3cd8871347e3bdb7cb2800049f7e8a97aca. If the paper ID doesn't exist, return some error instead of returning an entirely different paper ID.

Screenshots image

Additional context From the screenshot above the problem seems to be wrong mapping of some paperIDs (for some reason), perhaps because it was remapped in the past (?) and isn't updated properly (?).

aletar89 commented 1 year ago

I'm not from S2 but we encountered a similar issue in the past. Note that if you go to the s2 page of 3e83... you get redirected to the s2 page of ad21.... I think this happens when S2 merge duplicates. A couple of years ago, the default behavior was just dropping the old id and returning 404 which was much worse because you couldn't get to the paper. You can easily check it it's a different id and raise an error in your app if this is the behavior you want.