datacite / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
2 stars 0 forks source link

Validation and Documentation of Data Removal #17

Closed ashwinisukale closed 2 weeks ago

ashwinisukale commented 1 month ago

Description: After executing the data removal tasks, we need to validate the changes and document the process to ensure data integrity and correctness.

Tasks:

  1. Re-run the initial data analysis queries to compare the data before and after removal.
  2. Validate that the desired rows were correctly removed.
  3. Document the validation process, including the before and after counts, and any anomalies or issues encountered.

Validation Queries:


-- Re-run these queries to validate changes

-- Total count of rows in assertions table
SELECT COUNT(*) FROM assertions;

-- Verify removal of non-citation relationship types
SELECT COUNT(*) 
FROM assertions
WHERE source_id = '3644e65a-1696-4cdf-9868-64e7539598d2'
AND relation_type_id NOT IN (
    'cites', 'is-cited-by', 'references', 
    'is-referenced-by', 'is-supplemented-by', 'is-supplement-to'
);

-- Verify removal of clinical trials registries
SELECT COUNT(*)
FROM assertions
WHERE repository_id IN (
    'fef75a3c-6e48-4170-be9d-415601efb689',
    '2638e611-ff6f-49db-9b3e-702ecd16176b'
);

-- Verify removal of duplicate assertions
SELECT COUNT(*)
FROM (
    SELECT obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, source_id
    FROM assertions
    GROUP BY obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, source_id
    HAVING COUNT(*) > 1
) AS duplicate_sets;
ashwinisukale commented 2 weeks ago

Closing this as we already completed it