alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

Shuffle Option, Add Error Info, & Speedups Part 2 #57

Closed warrenmcg closed 3 years ago

warrenmcg commented 6 years ago

Hi @alyssafrazee and @JMF47,

Motivated by my own personal needs for improving the process of generating multiple simulated experiments, I set out to do two things:

  1. create a 'shuffle' option to meet the requirements for downstream quantification tools
  2. find ways to improve the speed

Shuffle Option

The first goal is implemented in write_reads.R (see b3fc42386f146dd1ee1236416910edfd01300cda). I basically just use sample to shuffle the order of the reads within each 1M chunk. I presume this is enough for downstream tools, but I may be wrong. Note that this is implemented as an extras option, and is documented accordingly.

Speed Improvements

The second goal is implemented with three major changes, built on the work of @ttdtrang in #38:

  1. In b8ae1ebf5c2c0864fcaff4e93c06b0fe5b155690, I used the parallel package to parallelize sgseq.R. I switched away from doParallel because I couldn't quite figure out how to import the package to pass checks. I added documentation to explain it to the user, and set the default to 1 to keep the default behavior of the current iteration. I agree with @ttdtrang's decision to make it an explicit option in the simulate_experiment.R (though I also included as an option for the other simulate_experiment* functions), though this could also easily be implemented as an extras option. Note that I have not looked into memory profiling for this option, so that may be something to look into before fully committing to incorporating it.

  2. minor speed enhancements by switching from paste0 to sprintf. I had independently come up with this idea, but @ttdtrang not only beat me to the punch, but also implemented it more cleanly than I did. Note that I removed the .Internal calls because of the complaints during the check process.

To get a sense of the benchmarks here, you can do the following:

library(microbenchmark)
readlen <- 100
tObj <- rep(0L, 1000000)
# mimic Ensembl transcript names
names(tObj) <- paste0('ENST00000', sprintf('%06d', 1:1000000))
set.seed(1)
# quick generation of fake transcript lengths and fragment lengths
L <- sample(50:5000, length(tObj), replace = TRUE)
fraglens <- sample(75:1970, length(tObj), replace = TRUE)
s <- which(fraglens < L)
n = length(s)
# get start positions using cDNA fragmentation bias
data('cdnaf', package = "polyester")
starts_pct = sample(cdnaf$pospct, size = n, prob = cdnaf$prob, 
                    replace = TRUE)
starts_pct[starts_pct == 1] = 0.999
start_pos = floor(starts_pct * (L[s] - fraglens[s] + 2))
start_pos[start_pos == 0] = 1

# create dummy functions to allow pretty labels for
# 'expr' column in benchmark
old_func <- function() {
  paste0(names(tObj[s]), ";mate1:", start_pos, 
         "-", start_pos + readlen - 1, ";mate2:", start_pos + 
         fraglens[s] - readlen, "-", start_pos + fraglens[s] - 1)
}
new_func <- function() {
  sprintf("%s;mate1:%d-%d;mate2:%d-%d", names(tObj)[s], 
        start_pos, start_pos + readlen - 1, start_pos + fraglens[s] - 
            readlen + 1, start_pos + fraglens[s] - 1)
}
m <- microbenchmark(old_func(), new_func(), times = 10) 

print(m, order = "median", digits = 3)
# Unit: seconds
#        expr  min   lq mean median   uq  max neval
#  new_func() 1.02 1.66 1.68   1.71 1.90 2.09    10
#  old_func() 4.09 4.18 4.29   4.23 4.32 4.74    10

A few seconds saved may not be much, but if there is a similar-scale improvement with any of the other paste0 -> sprintf changes, the changes can add up over the course of of a full iteration and across iterations. For example, in my case, I want to generate 30 million reads in ten samples over at least ten separate simulated experiments. 2.5 seconds saved per iteration 30 iterations 10 samples * 10 experiments = roughly two clock-hours saved over the full simulation.

  1. Major speed improvements were gained by reconfiguring the code in add_platform_error away from the nested ifelse statements. I couldn't get @ttdtrang's implementation to work for me (switch requires its EXPR argument to be length(EXPR) == 1, and p is not equal to 1). My solution is to to do the following (see cc6fee55fb38ca4230dfcb2497161c856c9d4202):
    1. realize that the error model data.frame has a regular pattern, so convert the relevant section to a matrix, and recast it as a 3D array made of 5x5 error matrices for each position
    2. calculate all new bases in an lapply call with the error matrix for each position; then remove entries for bases with no representation in that position
    3. replace each set of bases with the new error bases using a for loop

To do benchmarks, I used the Gencode V25 human annotation multiFASTA file, loaded into transcripts using Biostrings::readDNAStringSet. I then simulated some true counts and took the first column, loaded as sample_counts (if you want these counts, I can send them). Using this I generated reads for benchmarking using the following code:

library(microbenchmark)
tObj <- rep(transcripts, times = sample_counts)
offset <- 1L
tSubset = tObj[offset:min(offset + 1000000L, length(tObj))]
data('empirical_density', package = "polyester")
data('cdnaf', package = "polyester")
tFrags = polyester:::generate_fragments(tSubset, 250, 25, 
   100, "empirical", NULL, "cdnaf")
reads <- polyester:::get_reads(tFrags, readlen = 100, TRUE)
data('ill100v5_mate1', package = "polyester")
data('ill100v5_mate2', package = "polyester")

add_platform_error1 <- polyester:::add_platform_error

# add_platform_error2 is the new function;
# polyester with the improved function was installed separately in the directory ~/test_R_library
library(polyester, lib.loc = "~/test_R_library")
add_platform_error2 <- polyester:::add_platform_error

# because there is some variability introduced with the rlogspline step,
# I removed all ambiguity by setting the seed before running the function
old_func <- function(range) { set.seed(1); add_platform_error1(reads[range], "illumina5", TRUE) }
new_func <- function(range) { set.seed(1); add_platform_error2(reads[range], "illumina5", TRUE) }

m1K <- microbenchmark(old_func(1:1000), new_func(1:1000), times = 10)
m10K <- microbenchmark(old_func(1:10000), new_func(1:10000), times = 10)
m100K <- microbenchmark(old_func(1:100000), new_func(1:100000), times = 10)
mAll <- microbenchmark(old_func(1:length(reads)), new_func(1:length(reads)), times = 5)

print(m1K, order = "median", digits = 3)
# Unit: milliseconds
#              expr min  lq mean median  uq max neval
#  new_func(1:1000) 146 149  207    193 239 334    10
#  old_func(1:1000) 343 365  419    440 449 517    10

print(m10K, order = "median", digits = 3)
# Unit: milliseconds
#               expr  min   lq mean median   uq  max neval
#  new_func(1:10000)  285  309  378    359  434  563    10
#  old_func(1:10000) 1624 1752 1958   1949 2019 2599    10

print(m100K, order = "median", digits = 3)
# Unit: seconds
#               expr   min    lq  mean median   uq   max neval
#  new_func(1:1e+05)  1.68  1.72  1.78   1.76  1.8  2.06    10
#  old_func(1:1e+05) 14.28 14.42 14.58  14.55 14.8 14.95    10

print(mAll, order = "median", digits = 3)
# Unit: seconds
#                       expr   min    lq  mean median    uq max neval
#  new_func(1:length(reads))  30.4  31.6  31.5   31.7  31.8  32     5
#  old_func(1:length(reads)) 315.5 318.1 319.5  319.0 321.4 324     5

Open to feedback! And major kudos to @ttdtrang for the hard work in implementing the first set of commits!

To-dos:

warrenmcg commented 5 years ago

Bump! Just wanted to see if you all were interested in merging these changes or not.

alyssafrazee commented 5 years ago

@warrenmcg Thank you so much for doing this, and apologies for the delay in responding! Really appreciate your hard work here.

I think the community would benefit quite a bit from these changes (the speedups are very impressive!!). However, ballgown is currently largely in "maintenance mode only" -- that means I might try to fix high-priority or very buggy things, but that only happens in my spare time (since I work full-time on other projects).

This seems like a great improvement and I'm interested in accepting it, but I'd like the change to somehow be verified, since the diff is pretty large. I unfortunately don't have bandwidth to write a test suite for polyester right now (I wish I had written an earlier one), but the deal I can offer is: if you can show that this code doesn't introduce any more bugs, I'm super happy to merge it in. I see the benchmarks above, which are AMAZING, and I really appreciate that work! If you could also show that the simulated reads produced with the same seed are the same between master and this branch, I would love to accept this PR.

warrenmcg commented 5 years ago

Hi @alyssafrazee,

Thanks for the feedback! I'm in the process of setting up a test script right now. During this process, I realized that the current code, independent of comparisons with the old code, may not allow for reproducibility of new code using different settings based on two factors: 1) Comparing shuffled to unshuffled, any shuffling advances the RNG further, which means each subsequent batch of 1 million reads would be generated using different random numbers than if the reads hadn't been shuffled. 2) If using multiple cores, it is unlikely that the same job done twice will generate exactly the same reads because of how parallel RNG is done (the default behavior is that each child is given a new RNG stream based on the processID and time-stamp).

Here is what I propose to do:

Also, I think the default should be to set the seed (we can choose 1 or some other low number as the default seed), regardless of whether the user does so.

Let me know what you think!

warrenmcg commented 5 years ago

Hi @alyssafrazee,

I have completed the testing, and have verified that there were no substantial changes between the old and new code. I've attached the code I used so you can verify for yourself (though I had to change the extension to 'txt' for Github to let me upload it).

Code: polyester_test_code.txt

A few things to note:

Let me know if you have any questions!

warrenmcg commented 5 years ago

One more follow-up: I'm not sure why, but I noticed that when switching from Mersenne-Twister to L'Ecuyer-CMRG for RNG, the new reads ended up not being exactly the same as the old reads. Of the ~2M reads, roughly 20-25% had at least one difference, with the vast majority of those having only one difference. However, the start positions are all the same, and the count matrix is all the same too.

I will chock this up to how the RNG interacts with the change of how the platform error is added, and I personally would be ok with the tradeoff of slight differences of how errors are generated versus how much of a speed-up occurs with the changes.

creativequotient commented 4 years ago

Being able to parallelize the writing of files would be immensely helpful. Any chance this can be merged onto the main branch?