Make-Data-Count-Community / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
3 stars 0 forks source link

Remove assertions where subj_id=obj_id #28

Closed lizkrznarich closed 4 months ago

lizkrznarich commented 4 months ago

Approximately 80,000 assertions have the same DOI value in subj_id and obj_id. These are not valid citations for the sake of the corpus.

  1. Generate a report of assertions where subj_id=obj_id with columns from assertions table only
  2. Permanently remove assertions where subj_id=obj_id (from dev database where other data cleanup has been performed). Also remove any corresponding rows from assertions_affiliations, assertions_funders and assertions_subjects.

Note: All assertions where subj_id=obj_id are from CZI, so there is no need to limit the query by source