datacite / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
2 stars 0 forks source link

Investigate Invalid Accession Numbers for Top 20 Repositories #18

Closed ashwinisukale closed 3 months ago

ashwinisukale commented 4 months ago

Description: As part of the Data Citation Corpus data quality improvements, we need to investigate whether the accession numbers captured in CZI data are valid for the repositories they are attributed to. This task focuses on generating a report to identify invalid accession numbers for the top 20 repositories by assertion count.

Tasks:

  1. Generate a report in CSV format for each of the top 20 repositories (by assertion count) containing:
    • All fields from all rows in the assertions table that reference the repository in repository_id.
    • An additional field accession-number-pattern-match indicating whether the value in accession_number matches the published identifier pattern for the repository (True/False).

Query:

-- Example query to generate the report for a single repository
SELECT *,
       CASE
           WHEN accession_number ~ 'INSERT_PATTERN_HERE' THEN 'True'
           ELSE 'False'
       END AS accession_number_pattern_match
FROM assertions
WHERE repository_id = 'REPOSITORY_UUID_HERE'
AND source_id = 'c66aafc0-cfd6-4bce-9235-661a4a7c6126';

Replace INSERT_PATTERN_HERE with the actual regex pattern for each repository from the provided list.

Validation:

Verify that the report includes all necessary fields and the accession-number-pattern-match field is correctly populated. Cross-check a sample of the accession numbers against the published patterns to ensure accuracy.

Notes:

This task should be performed after the completion of data removal tasks. Use the database copy created for data removal to perform this investigation. Replace UUIDs from the assertions table with string values where convenient (repository_id, publisher_id, subject_id, source_id).

Doc: https://docs.google.com/document/d/1ffjsILvBKSOcuzrEQyh4BNF82ijOrcj4UkhPKwtpGnQ/edit