FINNGEN / kanta_lab_preprocessing

Repo for kanta lab QC
MIT License
1 stars 0 forks source link

Duplicates and filtering #10

Closed piotor87 closed 1 month ago

piotor87 commented 5 months ago

So far we've been very conservative with filtering out. ATM we filter out:

Kira is more aggressive and removes lines based on NA entries in a combination of fields:

// Only saving if we have either the value or at least the abnormality and a lab id
                if ((!((lab_value == "NA") & (lab_abnormality == "NA"))) & (lab_id != "NA")) {
                    // Increasing line count for duplicate lines in this file to one (meaning that this line is not actually duplicated)
                    all_dup_lines[dup_line] = 1;

Should we implement it in the same way?

Also, what should be the minimum set of keys to define a duplicate entry? We should probably look at the munged output and try to see if/what groups of columns produce the most duplicate values. The issue, however, is that the mock data was not built with this purpose in mind, so we might not be able to extrapolate that much.

piotor87 commented 4 months ago

Current status: i've added a WDL that handles subsetting of columns and duplicate removal. Column names are automatically fetched from magic_config.py (dev or main based on test boolean).

I've also added a sort_columns entry to the config so that also can be fetched automatically. ATM I use id,time and abbreviation to duplicate, will add later also value and unit