Shuffle Option, Add Error Info, & Speedups Part 2

Hi @alyssafrazee and @JMF47,

Motivated by my own personal needs for improving the process of generating multiple simulated experiments, I set out to do two things:

create a 'shuffle' option to meet the requirements for downstream quantification tools
find ways to improve the speed

Shuffle Option

The first goal is implemented in write_reads.R (see b3fc42386f146dd1ee1236416910edfd01300cda). I basically just use sample to shuffle the order of the reads within each 1M chunk. I presume this is enough for downstream tools, but I may be wrong. Note that this is implemented as an extras option, and is documented accordingly.

Speed Improvements

The second goal is implemented with three major changes, built on the work of @ttdtrang in #38:

In b8ae1ebf5c2c0864fcaff4e93c06b0fe5b155690, I used the parallel package to parallelize sgseq.R. I switched away from doParallel because I couldn't quite figure out how to import the package to pass checks. I added documentation to explain it to the user, and set the default to 1 to keep the default behavior of the current iteration. I agree with @ttdtrang's decision to make it an explicit option in the simulate_experiment.R (though I also included as an option for the other simulate_experiment* functions), though this could also easily be implemented as an extras option. Note that I have not looked into memory profiling for this option, so that may be something to look into before fully committing to incorporating it.
minor speed enhancements by switching from paste0 to sprintf. I had independently come up with this idea, but @ttdtrang not only beat me to the punch, but also implemented it more cleanly than I did. Note that I removed the .Internal calls because of the complaints during the check process.

To get a sense of the benchmarks here, you can do the following:

library(microbenchmark)
readlen <- 100
tObj <- rep(0L, 1000000)
# mimic Ensembl transcript names
names(tObj) <- paste0('ENST00000', sprintf('%06d', 1:1000000))
set.seed(1)
# quick generation of fake transcript lengths and fragment lengths
L <- sample(50:5000, length(tObj), replace = TRUE)
fraglens <- sample(75:1970, length(tObj), replace = TRUE)
s <- which(fraglens < L)
n = length(s)
# get start positions using cDNA fragmentation bias
data('cdnaf', package = "polyester")
starts_pct = sample(cdnaf$pospct, size = n, prob = cdnaf$prob, 
                    replace = TRUE)
starts_pct[starts_pct == 1] = 0.999
start_pos = floor(starts_pct * (L[s] - fraglens[s] + 2))
start_pos[start_pos == 0] = 1

# create dummy functions to allow pretty labels for
# 'expr' column in benchmark
old_func <- function() {
  paste0(names(tObj[s]), ";mate1:", start_pos, 
         "-", start_pos + readlen - 1, ";mate2:", start_pos + 
         fraglens[s] - readlen, "-", start_pos + fraglens[s] - 1)
}
new_func <- function() {
  sprintf("%s;mate1:%d-%d;mate2:%d-%d", names(tObj)[s], 
        start_pos, start_pos + readlen - 1, start_pos + fraglens[s] - 
            readlen + 1, start_pos + fraglens[s] - 1)
}
m <- microbenchmark(old_func(), new_func(), times = 10) 

print(m, order = "median", digits = 3)
# Unit: seconds
#        expr  min   lq mean median   uq  max neval
#  new_func() 1.02 1.66 1.68   1.71 1.90 2.09    10
#  old_func() 4.09 4.18 4.29   4.23 4.32 4.74    10

A few seconds saved may not be much, but if there is a similar-scale improvement with any of the other paste0 -> sprintf changes, the changes can add up over the course of of a full iteration and across iterations. For example, in my case, I want to generate 30 million reads in ten samples over at least ten separate simulated experiments. 2.5 seconds saved per iteration 30 iterations 10 samples * 10 experiments = roughly two clock-hours saved over the full simulation.

Major speed improvements were gained by reconfiguring the code in add_platform_error away from the nested ifelse statements. I couldn't get @ttdtrang's implementation to work for me (switch requires its EXPR argument to be length(EXPR) == 1, and p is not equal to 1). My solution is to to do the following (see cc6fee55fb38ca4230dfcb2497161c856c9d4202):
1. realize that the error model data.frame has a regular pattern, so convert the relevant section to a matrix, and recast it as a 3D array made of 5x5 error matrices for each position
2. calculate all new bases in an lapply call with the error matrix for each position; then remove entries for bases with no representation in that position
3. replace each set of bases with the new error bases using a for loop

To do benchmarks, I used the Gencode V25 human annotation multiFASTA file, loaded into transcripts using Biostrings::readDNAStringSet. I then simulated some true counts and took the first column, loaded as sample_counts (if you want these counts, I can send them). Using this I generated reads for benchmarking using the following code:

library(microbenchmark)
tObj <- rep(transcripts, times = sample_counts)
offset <- 1L
tSubset = tObj[offset:min(offset + 1000000L, length(tObj))]
data('empirical_density', package = "polyester")
data('cdnaf', package = "polyester")
tFrags = polyester:::generate_fragments(tSubset, 250, 25, 
   100, "empirical", NULL, "cdnaf")
reads <- polyester:::get_reads(tFrags, readlen = 100, TRUE)
data('ill100v5_mate1', package = "polyester")
data('ill100v5_mate2', package = "polyester")

add_platform_error1 <- polyester:::add_platform_error

# add_platform_error2 is the new function;
# polyester with the improved function was installed separately in the directory ~/test_R_library
library(polyester, lib.loc = "~/test_R_library")
add_platform_error2 <- polyester:::add_platform_error

# because there is some variability introduced with the rlogspline step,
# I removed all ambiguity by setting the seed before running the function
old_func <- function(range) { set.seed(1); add_platform_error1(reads[range], "illumina5", TRUE) }
new_func <- function(range) { set.seed(1); add_platform_error2(reads[range], "illumina5", TRUE) }

m1K <- microbenchmark(old_func(1:1000), new_func(1:1000), times = 10)
m10K <- microbenchmark(old_func(1:10000), new_func(1:10000), times = 10)
m100K <- microbenchmark(old_func(1:100000), new_func(1:100000), times = 10)
mAll <- microbenchmark(old_func(1:length(reads)), new_func(1:length(reads)), times = 5)

print(m1K, order = "median", digits = 3)
# Unit: milliseconds
#              expr min  lq mean median  uq max neval
#  new_func(1:1000) 146 149  207    193 239 334    10
#  old_func(1:1000) 343 365  419    440 449 517    10

print(m10K, order = "median", digits = 3)
# Unit: milliseconds
#               expr  min   lq mean median   uq  max neval
#  new_func(1:10000)  285  309  378    359  434  563    10
#  old_func(1:10000) 1624 1752 1958   1949 2019 2599    10

print(m100K, order = "median", digits = 3)
# Unit: seconds
#               expr   min    lq  mean median   uq   max neval
#  new_func(1:1e+05)  1.68  1.72  1.78   1.76  1.8  2.06    10
#  old_func(1:1e+05) 14.28 14.42 14.58  14.55 14.8 14.95    10

print(mAll, order = "median", digits = 3)
# Unit: seconds
#                       expr   min    lq  mean median    uq max neval
#  new_func(1:length(reads))  30.4  31.6  31.5   31.7  31.8  32     5
#  old_func(1:length(reads)) 315.5 318.1 319.5  319.0 321.4 324     5

Open to feedback! And major kudos to @ttdtrang for the hard work in implementing the first set of commits!

To-dos:

[x] Edit README.md and vignettes to include discussion of shuffle and ncores options [[Not sure how best to do this, though I have a suggestion as a commit]]

Bump! Just wanted to see if you all were interested in merging these changes or not.

@warrenmcg Thank you so much for doing this, and apologies for the delay in responding! Really appreciate your hard work here.

I think the community would benefit quite a bit from these changes (the speedups are very impressive!!). However, ballgown is currently largely in "maintenance mode only" -- that means I might try to fix high-priority or very buggy things, but that only happens in my spare time (since I work full-time on other projects).

This seems like a great improvement and I'm interested in accepting it, but I'd like the change to somehow be verified, since the diff is pretty large. I unfortunately don't have bandwidth to write a test suite for polyester right now (I wish I had written an earlier one), but the deal I can offer is: if you can show that this code doesn't introduce any more bugs, I'm super happy to merge it in. I see the benchmarks above, which are AMAZING, and I really appreciate that work! If you could also show that the simulated reads produced with the same seed are the same between master and this branch, I would love to accept this PR.

Hi @alyssafrazee,

Thanks for the feedback! I'm in the process of setting up a test script right now. During this process, I realized that the current code, independent of comparisons with the old code, may not allow for reproducibility of new code using different settings based on two factors: 1) Comparing shuffled to unshuffled, any shuffling advances the RNG further, which means each subsequent batch of 1 million reads would be generated using different random numbers than if the reads hadn't been shuffled. 2) If using multiple cores, it is unlikely that the same job done twice will generate exactly the same reads because of how parallel RNG is done (the default behavior is that each child is given a new RNG stream based on the processID and time-stamp).

Here is what I propose to do:

[x] For the shuffling issue, there is a way to store the random seed being used, use a new stream to do the shuffling, and then restore the random seed for later read generation. That way, reads that are generated shuffled will be the same reads generated when unshuffled. --OR-- we can just accept that the reads will be different, but that if you rerun the same code with the same options, the reads would be the same.
[x] For the parallel RNG issue, the only solution using the parallel package is to change the RNG kind from the default "Mersenne Twister" to the one supported by parallel, "L'Ecuyer-CMRG". To make sure we don't mess with the user's own environment too much, we can store information about the current RNGkind before setting it differently and then restore it at the end.

Also, I think the default should be to set the seed (we can choose 1 or some other low number as the default seed), regardless of whether the user does so.

Let me know what you think!

Hi @alyssafrazee,

I have completed the testing, and have verified that there were no substantial changes between the old and new code. I've attached the code I used so you can verify for yourself (though I had to change the extension to 'txt' for Github to let me upload it).

Code: polyester_test_code.txt

A few things to note:

To run the code yourself, you need to do four things:
1. have old polyester (commit 29263d1) installed in your standard library location
2. edit the script to set a non-standard library location as the variable new_poly_lib.
3. have new polyester (warrenmcg/polyester, commit 6749d76) installed in the non-standard location (easiest to do it is .libPaths("/path/to/new/lib"), then devtools::install_github('warrenmcg/polyester')).
4. install my package absSimSeq to simulate the fold changes.
I fixed an "off-by-one" error when naming paired-end reads. The default read length is 100, but the read name was showing mate 2 to be length 99. Thus, the names between the old and new will not be the same. The comparison of the actual reads should be the same.
Comparing the shuffled and unshuffled reads, the ranges slot depends on how the file is read, and therefore they won't be the same between the shuffled and unshuffled reads. The comparison of the actual reads (& names) should be the same.
Finally, I modified polyester to change the random number generator to "L'Ecuyer-CMRG", as that is the only one supported for parallel RNG. To make sure the comparison with the old code was the same, my script changes the RNG before running the old code, and then clears it out before running the new code.

Let me know if you have any questions!

One more follow-up: I'm not sure why, but I noticed that when switching from Mersenne-Twister to L'Ecuyer-CMRG for RNG, the new reads ended up not being exactly the same as the old reads. Of the ~2M reads, roughly 20-25% had at least one difference, with the vast majority of those having only one difference. However, the start positions are all the same, and the count matrix is all the same too.

I will chock this up to how the RNG interacts with the change of how the platform error is added, and I personally would be ok with the tradeoff of slight differences of how errors are generated versus how much of a speed-up occurs with the changes.

Being able to parallelize the writing of files would be immensely helpful. Any chance this can be merged onto the main branch?

alyssafrazee / polyester