When we're calculating multiple PGS many of the rows are redundant.
EFO
N Scores
N_variants
N_unique (chr/pos/eff/oth)
% unique
Autoimmune
60
12949261
8494054
0.66
CVD
125
89802593
16654741
0.19
Does it speed it up if we:
Subsetting to unique variants ([chr, pos, eff, oth]) for matching
Rejoining matches with the original scoring file
Run variant labelling (requires the scoring file info)
Another way to do this may be to make a wide DF (e.g. each PGS is a column), do the matching, then labelling, then splitting?
Running the matching in series (e.g when genotyping data is split by chromosome) is slow and makes the wall time of the pipeline very long. We should spawn parallel match_variants and then aggregate the logs. [This will save on memory and wall-time, implementation will partially be within pgsc_calc]
Does it speed it up if we:
Another way to do this may be to make a wide DF (e.g. each PGS is a column), do the matching, then labelling, then splitting?