FINNGEN / kanta_lab_preprocessing

Repo for kanta lab QC
MIT License
1 stars 0 forks source link

Add deduplication based on OID-triplet #17

Closed vincent-octo closed 1 month ago

vincent-octo commented 2 months ago

One thing that Kira meant to do but didn't get to it was: identify duplicate records based on some specific set of OIDs.

I believe that following columns should be looked at together to identify additional duplicate rows: asiakirjaoid, merkintaoid, entryoid.

Maybe best to discuss with Kira about it to get more clarity.

piotor87 commented 2 months ago

Yep. Gotta make a wdl where I split the files, sort the chunks, join them back and then I can easily remove duplicate entries. The munging should preserve rhe order so we can also check for duplicates after munging if needed.

piotor87 commented 1 month ago

Closing as it's now addressed with a WDL.