BAIs get incorrectly added to literature records from author.xml

michamos commented 3 months ago

The code extracting identifiers from author.xml files blindly trusts all identifiers and tries to add them to the author (causing a validation error if it's an unknown id later down the line). This is fine for things like ORCID, but not for INSPIRE BAIs, as they have been removed from literature records and are now supposed to be generated dynamically from the linked author record during serialization: https://github.com/inspirehep/inspirehep/blob/4d514d4a046819ef984defc0435c413f3d90ce10/backend/inspirehep/records/marshmallow/literature/common/author.py#L62-L78.

The consequence is that we have hardcoded BAIs in literature records, which get out of sync with the linked author BAI in case the BAI has changed. Example: https://inspirehep.net/literature?sort=mostrecent&size=25&page=1&q=a%20Michele%20Selvaggi%20and%20a%20M.Selvaggi.1. These should all have BAI Michele.Selvaggi.1 instead of M.Selvaggi.1 but don't because of the hardcoding.

We should fix the bug and run the script in https://github.com/inspirehep/curation-scripts/blob/master/scripts/remove-bai-from-lit-authors/script.py again to fix existing records.

drjova commented 3 months ago

@michamos I still don't understand what we should do with this, do we have to keep the INSPIRE BAIs instead of removing??

michamos commented 3 months ago

No, we should ignore the INSPIRE BAIs coming from author.xml

drjova commented 3 months ago

Thanks, could you please provide an author.xml with BAIs?

michamos commented 3 months ago

Hmmm, I can't find any examples. So I don't understand where the hardcoded BAIs are coming from. I assumed author.xml files but that doesn't seem to be the case. Let's put this on hold, I'll run the cleanup script again and we'll see if the issue happens again.

cern-sis / issues-inspire

BAIs get incorrectly added to literature records from author.xml #460