Al-Murphy / MungeSumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics
https://doi.org/doi:10.18129/B9.bioc.MungeSumstats
75 stars 16 forks source link

Speed improvements #6

Closed bschilder closed 3 years ago

bschilder commented 3 years ago

Not sure if this is already implemented in some places, but MungeSumstats could really benefit from optimizing the speed of processes. Not always an easy task, but here's some ideas.

1. Comb through the code and search for inefficiencies

2. Parallelize across CPUs

3. Parallelize across GPU(s)

A la cudf, which is part of the RAPIDS suite.

bschilder commented 3 years ago

Both tabular and VCF format sum stats can now be read in using data.table::fread() which speeds up most processes even when single-threaded.

Also, removed all instances where full data file was read in multiple times to extract header info. Replaced with convenience function read_header() which only reads in the first 2 lines.

bschilder commented 3 years ago

import_sumstats can now run in parallel across multiple Open GWAS IDs when parallel_across_ids=TRUE. Otherwise, multiple cores can be allocated to processing each dataset.

bschilder commented 3 years ago

I think this is in a pretty good place now that we've done a lot of optimization. Always room for improvement, but I think it's justified to close this for now.