Make-Data-Count-Community / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
3 stars 0 forks source link

Check for duplicates in event data import #33

Open lizkrznarich opened 3 months ago

lizkrznarich commented 3 months ago

There is code that attempts to exclude duplicate citations from being added to the assertions table, however it is not clear that this code works as expected since we had to manually remove duplicates from the assertions table as part of v2.0 cleanup work.

For new citations ingested recently from DataCite event data, check whether any duplicates were created. If so, adapt this code to properly handle duplicates so we do not have to clean them out later https://gitlab.coko.foundation/datacite/datacite/-/blob/main/packages/server/services/seedSource/dataCiteEventData.js?ref_type=heads#L107.

Note that an event can be duplicated in DataCite Event Data many times because new events are created for all related IDs every time a member adds/updates any related ID for a given DOI. That means data returned by a single query to the Event Data API can include many instances of an event with the same subj_id, obj_id and relation_type_id (but with a different timestamp).