UW-GAC / GENESIS

GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness
https://bioconductor.org/packages/GENESIS
34 stars 13 forks source link

Parallel #71

Closed smgogarten closed 3 years ago

smgogarten commented 3 years ago

Loops in pcrelate, assocTestSingle, assocTestAggregate, and admixMap have been refactored to use the BiocParallel package. With BiocParallel, the user can define a parallel backend and either register it in the current R session or pass it as an argument to the GENESIS function, and that backend will be used to control parallelization. The default is Multicore, which uses one less than the available number of CPUs on the machine (of those, one is the head job so the actual number of workers is N-2).

We use the bpiterate function which takes two arguments: an iterator that supplies a new chunk of data every time it is called, and a function to run on that data (which can take additional arguments). SeqVarIterator objects are used to construct the iterators, which are defined separately within each method to return genotype data in the appropriate format. The functions called by bpiterate are shared across all methods for a given function (e.g., the .testGenoBlockSingle function is called by both SeqVarIterator and GenotypeIterator methods for assocTestSingle). This reduces the duplication of code compared to previous versions of GENESIS. bpiterate has the option of either returning a list containing all elements, or using a "reduce" function to combine elements. The association functions use list return and combine them with rbindlist as previously. pcrelate previously used a reduce function as implemented in the foreach package, and that has been replaced by the bpiterate call.

bpiterate returns errors as list elements, which is undesirable since all error messages then result in a failure of rbindlist to bind errors with successful output. This was addressed with the .stopOnError function that checks for errors and then prints them along with the data chunk in which they occurred.

Performance testing was done on a dataset with ~45k samples. I compared three scenarios: 1) previous GENESIS code, 2) BiocParallel with serial execution, 3) BiocParallel with multicore execution on a 24-core machine. In summary, the memory usage and run time is very similar between BiocParallel serial and the original code. BiocParallel multicore substantially reduces run time at the expense of additional memory.

assocTestSingle: Scenario Run time (s) Memory (GB)
Original 19369 5.6
BiocParallel serial 20063 5.8
BiocParallel multicore 4211 80
assocTestAggregate: Scenario Run time (s) Memory (GB)
Original 455 2.6
BiocParallel serial 319 2.6
BiocParallel multicore 266 38
pcrelate: Scenario Run time (s) Memory (GB)
Original 37049 28
BiocParallel serial 36619 21
BiocParallel multicore 13279 81
codecov-commenter commented 3 years ago

Codecov Report

Merging #71 (d6cae8d) into master (cf70704) will increase coverage by 2.99%. The diff coverage is 95.02%.

:exclamation: Current head d6cae8d differs from pull request most recent head 8586975. Consider uploading reports for the commit 8586975 to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
+ Coverage   81.08%   84.07%   +2.99%     
==========================================
  Files          38       38              
  Lines        2823     2851      +28     
==========================================
+ Hits         2289     2397     +108     
+ Misses        534      454      -80     
Impacted Files Coverage Δ
R/nullModelFastScore.R 84.70% <85.71%> (+83.51%) :arrow_up:
R/pcrelate.R 89.91% <88.88%> (+0.53%) :arrow_up:
R/assocTestSingle.R 93.10% <94.33%> (+2.09%) :arrow_up:
R/admixMap.R 89.62% <96.00%> (+0.96%) :arrow_up:
R/assocTestAggregate.R 96.39% <97.40%> (+0.18%) :arrow_up:
R/utils.R 97.70% <100.00%> (-0.49%) :arrow_down:
R/util_methods.R 88.15% <0.00%> (+0.48%) :arrow_up:
R/testGeno.R 60.12% <0.00%> (+0.63%) :arrow_up:
R/nullModelTestPrep.R 85.36% <0.00%> (+17.07%) :arrow_up:
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cf70704...8586975. Read the comment docs.

mconomos commented 3 years ago

@smgogarten - as far as I can tell, this all looks good. I left a couple minor comments/questions.

Going back to one of my comments -- there's actually a fair amount of redundant code between assocTestSingle and assocTestAggregate, as well as .testGenoBlockSingle and .testGenoBlockAggregate. I'm not sure if it's worthwhile to try to turn those repeated bits into sub-functions themselves, or if that's a waste of time and effort.

mconomos commented 3 years ago

@smgogarten - as far as I can tell, this all seems good to go!