Duplicates and filtering

So far we've been very conservative with filtering out. ATM we filter out:

incorrect hetu roots
invalid measurement statuses

Kira is more aggressive and removes lines based on NA entries in a combination of fields:

// Only saving if we have either the value or at least the abnormality and a lab id
                if ((!((lab_value == "NA") & (lab_abnormality == "NA"))) & (lab_id != "NA")) {
                    // Increasing line count for duplicate lines in this file to one (meaning that this line is not actually duplicated)
                    all_dup_lines[dup_line] = 1;

Should we implement it in the same way?

Also, what should be the minimum set of keys to define a duplicate entry? We should probably look at the munged output and try to see if/what groups of columns produce the most duplicate values. The issue, however, is that the mock data was not built with this purpose in mind, so we might not be able to extrapolate that much.

FINNGEN / kanta_lab_preprocessing

Duplicates and filtering #10