datacite / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
2 stars 0 forks source link

Remove Rows with Non-Citation Relationship Types #14

Closed ashwinisukale closed 3 months ago

ashwinisukale commented 4 months ago

As part of the Data Citation Corpus data quality improvements, we need to remove rows from the assertions table that have non-citation relationship types. The goal is to clean the database of assertions that do not indicate a citation.

Tasks:

  1. Identify rows with source_id 3644e65a-1696-4cdf-9868-64e7539598d2 (DataCite) and a relation_type_id not in the following list:
    • cites
    • is-cited-by
    • references
    • is-referenced-by
    • is-supplemented-by
    • is-supplement-to
  2. Execute the query to remove these rows.

Query:

DELETE FROM assertions
WHERE source_id = '3644e65a-1696-4cdf-9868-64e7539598d2'
AND relation_type_id NOT IN (
    'cites', 'is-cited-by', 'references', 
    'is-referenced-by', 'is-supplemented-by', 'is-supplement-to'
);

Validation: Before executing the deletion, count the number of rows matching the criteria. After execution, verify that these rows have been removed.