alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

Parallelize simulate_experiment()? #32

Open kcha opened 8 years ago

kcha commented 8 years ago

Hi,

Thanks for this useful package! I was wondering if there were any plans to parallelize read simulation?

I noticed that it might be possible to parallelize the outer for loop in sgreg(). I tried replacing the for loop with foreach from the DoMC package. It was a quick change and although I didn't do extensive testing, it seems to speed things up significantly when more than one replicate or group is being simulated (see: kcha/polyester@7b6c31e60f6608f1b024d8f4be8833ce02d9e62f).

Interested in hearing your thoughts!

library(polyester)
library(doMC)

fold_changes = matrix(c(1, 1), nrow = 1)

for (c in c(1,4,8)) {
  t <- system.time(
    simulate_experiment('data/toy.fa', 
                        readlen = 100,
                        reads_per_transcript = 10000,
                        fold_changes = fold_changes,
                        num_reps=c(4, 4), 
                        outdir='simulated_reads/single',
                        distr="empirical",
                        error_model = "illumina5",
                        paired=FALSE,
                        gzip=TRUE, cores = c) 
  )
  print(paste("Cores:", c))
  print(t)
}
[1] "Cores: 1"
   user  system elapsed 
 27.032   0.974  28.075 
[1] "Cores: 4"
   user  system elapsed 
 22.472   0.842   7.969 
[1] "Cores: 8"
   user  system elapsed 
 49.123   2.340   7.094 
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.3 (El Capitan)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] doMC_1.3.4      iterators_1.0.8 foreach_1.4.3   polyester_1.7.1

loaded via a namespace (and not attached):
 [1] compiler_3.2.3      zlibbioc_1.14.0     limma_3.24.15      
 [4] IRanges_2.2.9       tools_3.2.3         XVector_0.8.0      
 [7] logspline_2.1.9     Biostrings_2.36.4   codetools_0.2-14   
[10] S4Vectors_0.6.6     BiocGenerics_0.14.0 stats4_3.2.3