Speed improvements - Githubissues

bschilder commented 3 years ago

Not sure if this is already implemented in some places, but MungeSumstats could really benefit from optimizing the speed of processes. Not always an easy task, but here's some ideas.

1. Comb through the code and search for inefficiencies

[DONE ✅] I can take a stab at this since i have fresh eyes.

2. Parallelize across CPUs

Run non-sequentially dependent steps in parallel
Chunk the sum stats and process each chunk on a different core. This is basically what DelayedArray does, so that might be another option.
[DONE ✅] Process each sumstats dataset in parallel (implemented in import_sumstats).

3. Parallelize across GPU(s)

A la cudf, which is part of the RAPIDS suite.

bschilder commented 3 years ago

Both tabular and VCF format sum stats can now be read in using data.table::fread() which speeds up most processes even when single-threaded.

Also, removed all instances where full data file was read in multiple times to extract header info. Replaced with convenience function read_header() which only reads in the first 2 lines.

bschilder commented 3 years ago

import_sumstats can now run in parallel across multiple Open GWAS IDs when parallel_across_ids=TRUE. Otherwise, multiple cores can be allocated to processing each dataset.

bschilder commented 3 years ago

I think this is in a pretty good place now that we've done a lot of optimization. Always room for improvement, but I think it's justified to close this for now.

Al-Murphy / MungeSumstats

Speed improvements #6

1. Comb through the code and search for inefficiencies

2. Parallelize across CPUs

3. Parallelize across GPU(s)