PGScatalog / pgscatalog_utils

(superseded by pygscatalog) Utilities for working with PGS Catalog API and scoring files
Apache License 2.0
4 stars 3 forks source link

Further match_variants optimisations #27

Closed smlmbrt closed 1 year ago

smlmbrt commented 1 year ago
  1. When we're calculating multiple PGS many of the rows are redundant.
EFO N Scores N_variants N_unique (chr/pos/eff/oth) % unique
Autoimmune 60 12949261 8494054 0.66
CVD 125 89802593 16654741 0.19

Does it speed it up if we:

Another way to do this may be to make a wide DF (e.g. each PGS is a column), do the matching, then labelling, then splitting?

  1. Running the matching in series (e.g when genotyping data is split by chromosome) is slow and makes the wall time of the pipeline very long. We should spawn parallel match_variants and then aggregate the logs. [This will save on memory and wall-time, implementation will partially be within pgsc_calc]
nebfield commented 1 year ago