SSHOC / marketplace-curation

Project to manage scripts and auxiliary data, via Python library and Jupyter notebooks, for the curation of the SSH Open Marketplace
0 stars 0 forks source link

process duplicates identified by notebook 4.1 #27

Open laureD19 opened 3 months ago

laureD19 commented 3 months ago

duplicates mainly due to re-ingest errors of the CRF

notebook 4.1 currently gives back:

@carikan @mkrzmr - the suggestion would be to go through notebook 4.1 together and decide how we can share the work for the merges needed

mkrzmr commented 3 months ago

dup_label.csv dup_url.csv

Ran the same and uploaded the files. Suggest we divide the work and merge the items Q: What to do with item sources? Might create issues with ingest

laureD19 commented 3 months ago

If I'm not wrong, the new item created during a merge doesn't have an item source, but keep in its history the two (or more) items merged (including their sources). With DACE, in case of reingest, in theory, the ingest pipeline notices the difference between the first ingest and the one happenings and marks the problematic item(s) for moderators to have a look at it before their are approved.