Open michamos opened 3 months ago
@michamos I still don't understand what we should do with this, do we have to keep the INSPIRE BAIs instead of removing??
No, we should ignore the INSPIRE BAIs coming from author.xml
Thanks, could you please provide an author.xml with BAIs?
Hmmm, I can't find any examples. So I don't understand where the hardcoded BAIs are coming from. I assumed author.xml files but that doesn't seem to be the case. Let's put this on hold, I'll run the cleanup script again and we'll see if the issue happens again.
The code extracting identifiers from author.xml files blindly trusts all identifiers and tries to add them to the author (causing a validation error if it's an unknown id later down the line). This is fine for things like ORCID, but not for INSPIRE BAIs, as they have been removed from literature records and are now supposed to be generated dynamically from the linked author record during serialization: https://github.com/inspirehep/inspirehep/blob/4d514d4a046819ef984defc0435c413f3d90ce10/backend/inspirehep/records/marshmallow/literature/common/author.py#L62-L78.
The consequence is that we have hardcoded BAIs in literature records, which get out of sync with the linked author BAI in case the BAI has changed. Example: https://inspirehep.net/literature?sort=mostrecent&size=25&page=1&q=a%20Michele%20Selvaggi%20and%20a%20M.Selvaggi.1. These should all have BAI
Michele.Selvaggi.1
instead ofM.Selvaggi.1
but don't because of the hardcoding.We should fix the bug and run the script in https://github.com/inspirehep/curation-scripts/blob/master/scripts/remove-bai-from-lit-authors/script.py again to fix existing records.