adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Ensure metadata for non-canonical bibcodes is retained upon deletion #248

Open aaccomazzi opened 4 years ago

aaccomazzi commented 4 years ago

When we apply deletions, we remove records indexed under a bibcode that is no longer canonical but that should still be recognized by the system. The problem arises when a published bibcode is ingested prior to its match to an eprint. When this happens, the published bibcode's ingest will not grab and merge the arXiv metadata (which includes its bibcode, arXiv id, keywords, etc). Subsequently, when the match occurs, the arXiv bibcode is no longer canonical and gets removed, but a reingest of the publisher bibcode does not happen automatically, rather it happens on a weekly scheduled when the fingerprint for the published metadata is computed.

An example may illustrate the problem better:

2020-01-20 - ingest: 2020arXiv200106997J (w/ arXiv metadata)
2020-03-01 - ingest: 2020ApJ...890..171J (w/ publisher metadata)
2020-03-02 - match:  2020arXiv200106997J & 2020ApJ...890..171J
2020-03-02 - delete: 2020arXiv200106997J (no longer canonical)
2020-03-08 - ingest: 2020ApJ...890..171J (w/ publisher + arXiv metadata)

I suggest we add a step to the deletion process which will check whether the bibcode of the record to be deleted is an alternate, and if so, force the reingest of the corresponding published record. An alternative approach could be to check, prior to deletion, the canonical record's json fingerprint to find out if it contains the arXiv id of the record about to be deleted in the list of alternates, but this seems a more fragile solution because it would only work for arXiv ids and not other alternates.