datacite / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
2 stars 0 forks source link

Remove Duplicate Assertions #16

Closed ashwinisukale closed 3 weeks ago

ashwinisukale commented 1 month ago

Description: The assertions table currently contains many duplicate entries, overstating the number of citations. This task aims to remove duplicate rows, keeping only the most recent entry based on the updated field.

Tasks:

  1. Identify duplicate rows based on the combination of obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id.
  2. Keep the most recent entry for each set of duplicates and remove the others.

Query:


WITH ranked_assertions AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, source_id ORDER BY updated DESC) AS rn
    FROM assertions
    where source_id = 'c66aafc0-cfd6-4bce-9235-661a4a7c6126'
)
delete obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number FROM assertions
WHERE id IN (
    SELECT id
    FROM ranked_assertions
    WHERE rn > 1
);