allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
190 stars 29 forks source link

Searching by paperId returns a different paper #157

Closed PedroSena closed 8 months ago

PedroSena commented 1 year ago

Describe the bug When I search for a specific paperId a different paperId is returned.

To Reproduce Simply run:

 curl -X POST https://api.semanticscholar.org/graph/v1/paper/batch\?fields\=paperId -d '{"ids": ["e87b03add4f6fc193054023b173624bd98566f64"]}'

Expected behavior

[{"paperId": "e87b03add4f6fc193054023b173624bd98566f64"}]

Current behavior Instead of returning the paperId I requested it returns another

[{"paperId": "93d2312a1cade9041d78b28e519266b606b5b366"}]

Additional Context The e87... paperId I obtained was from Semantic Scholar and it was pointing to a related PubMed:

"externalIds"=>{"DOI"=>"10.1080/09602011.2023.2255797", "PubMed"=>"37818746", "CorpusId"=>261926482, "PubMedCentral"=>"10581430"}

While 93d... returns:

 "externalIds": {"PubMedCentral": "10545072", "DOI": "10.1080/09602011.2023.2255797", "CorpusId": 264334050, "PubMed": "37866888"}, "corpusId": 264334050

Notice the same DOI but different CorpusId, PubMed and PubMedCentral.

I'd like to better understand if paperId is really a unique identifier for a publication and how to ensure that whenever I search for paper X I get paper X in return.

cfiorelli commented 1 year ago

This is due to paper merging, if you search the IDs by inputting semanticscholar.org/paper/[IDHERE] they are rerouting to the same paper. Hopefully this clarifies a bit!

thank you

PedroSena commented 1 year ago

Hi @cfiorelli ,

Thanks for the clarification.

Is there a way to identify that a paper was merged with another (Besides counting on this rerouting) ? We are using the batch API and requesting multiple papers at once, having the ability to identify those cases would be ideal for us.

Thanks

serenalotreck commented 1 year ago

I'd like to bump this question, I think I am also experiencing this. I'm trying to get the publication years for a list of papers, and I can't match the result up with a list of other information I have about those papers because not all of the paperId's that return are the same as the ones I requested.

I'm not sure how I would be able to get around this; I have about ~6,000 paperIds that no longer match up, so I can't reasonably manually plug them in to the URL to check.

cfiorelli commented 8 months ago

@PedroSena - I dont currently have a way to identify paper merges. There might be a clever solution - but there is nothing to do this directly supported by Semantic Scholar.

@serenalotreck - I've got some history of these issues which i worked on internally and found an expectation that updates to clustering (on a multi month scale) would be drawing to a conclusion. We are well past delivery of these updates and hopefully your 6k is much lower.

I think high level we are operating with the thinking that we try to minimize errors, but the models at this scale push us to have some tolerances.

Thank you all for your ongoing participation and usage of the API !