datacite / corpus-data-file

Code and steps used to generate the Data Citation Corpus dump file
MIT License
2 stars 0 forks source link

v2 release: Remove assertions with invalid accession numbers #43

Open lizkrznarich opened 1 month ago

lizkrznarich commented 1 month ago

Following on from work done on https://github.com/datacite/corpus-data-file/issues/18 and https://github.com/datacite/corpus-data-file/pull/37 to validate accession numbers and generate reports with validation results, remove assertions with accession numbers that have been identified by the product team as invalid per the spec in https://docs.google.com/document/d/1SaIU_HUFIMQN-PrYjZ0JJXuYBydWcYOLaQs6dJ1uW9c/edit .

Files with lists of assertions to remove are located in https://drive.google.com/drive/u/3/folders/19T2ICvsDUYk4VA9dbDagsWp5VGIAo7Vd

kaysiz commented 3 weeks ago

The CSVs have 502612 assertion ids to remove, however 28820 of those ids where already removed in the previous cleanup processes, so we ended up having 473792 records to remove form the assertions table.

kaysiz commented 3 weeks ago

done - snapshot created delete-invalid-assertions-5256114