Closed piotor87 closed 1 month ago
Current status:
i've added a WDL that handles subsetting of columns and duplicate removal.
Column names are automatically fetched from magic_config.py
(dev or main based on test boolean).
I've also added a sort_columns
entry to the config so that also can be fetched automatically. ATM I use id,time and abbreviation to duplicate, will add later also value and unit
So far we've been very conservative with filtering out. ATM we filter out:
Kira is more aggressive and removes lines based on NA entries in a combination of fields:
Should we implement it in the same way?
Also, what should be the minimum set of keys to define a duplicate entry? We should probably look at the munged output and try to see if/what groups of columns produce the most duplicate values. The issue, however, is that the mock data was not built with this purpose in mind, so we might not be able to extrapolate that much.