alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.
http://biorxiv.org/content/early/2014/12/12/006015
89 stars 51 forks source link

Simulate 1 million reads at a time to avoid running out of memory. #29

Closed roryk closed 8 years ago

roryk commented 8 years ago

Since simulating is capped at 2^31 nucleotides, this simulates the reads in batches of 1 million reads and appends the results to the output file. Sorry for the whitespace fixes, editor removed the trailing whitespace from a bunch of lines.

roryk commented 8 years ago

I also pushed a fix to the Travis-CI configuration to use the native R support.

roryk commented 8 years ago

Refactored it a little bit to keep the same function signature for write_reads.

roryk commented 8 years ago

Pinging about this-- no love?

jtleek commented 8 years ago

Just having a busy week. Can I look it over Monday and let you know?

On Fri, Nov 13, 2015 at 9:35 AM Rory Kirchner notifications@github.com wrote:

Pinging about this-- no love?

— Reply to this email directly or view it on GitHub https://github.com/alyssafrazee/polyester/pull/29#issuecomment-156449187 .

roryk commented 8 years ago

Thanks Jeff for responding to the ping, no worries.

lcolladotor commented 8 years ago

@roryk Regarding the Travis switch from R Travis v1 to v2, I wish that it was as simple as that. See https://github.com/kasperdanielhansen/bsseq/issues/5 for a more lengthly discussion. I have also switched to R Travis v2 on my repositories (see https://github.com/lcolladotor/derfinder/blob/master/.travis.yml) but have other Travis configuration scripts for using https://github.com/metacran/r-builder if R-devel becomes absolutely necessary for testing. Do note, that the configuration you specified for Travis v2 is slightly different from the BioC nightly build test. You need to add:

r_build_args: "--no-manual --no-resave-data"
r_check_args: "--no-build-vignettes --no-manual --timings"
env:
  global:
    - _R_CHECK_TIMINGS_="0"

Your commit https://github.com/roryk/polyester/commit/4919c296d2f6ec8cfa816d0aaf3add82dbb1c405 is quite impressive. I suggest adding some unit tests using testthat for the functions you modified. For now, we only know that they didn't break the examples but none of them use your new argument offset. Or at least add a short example that uses a value for offset that is not 1 so it gets regularly tested in the Bioc nightly builds. It seems that it might also be worth adding a line like this:

stopifnot(is.integer(offset) & offset >= 1)

You would then need to change the default value of offset to 1L and explain that it has to be an integer >= 1 in the docs.

That same commit looks very long but I noticed that you mostly changed the formatting of some functions. Specially, by removing some trailing white space. That is true for R/sgseq.R, R/simulate_experiment.R, and partially for R/write_reads.R. Then in R/write_reads.R you added offset which is then passed as a TRUE/FALSE to writeXStringSet(append). Correct? Or did I miss other changes?

lcolladotor commented 8 years ago

Sorry, it looks like you also changed R/sgseq.R quite a bit (beyond formatting).

roryk commented 8 years ago

Hi @lcolladotor,

Thanks for responding-- and thanks for the information about the update from v1 to v2, that's super helpful. I'm not a big R package guy so I had no clue. I can cherry-pick just the relevant changes for 4919c29 into a new branch if you'd rather stay at v1.

Sorry about the whitespace edits, I have my editor configured to clean that up. I can drop those changes as well if it is an issue.

You're right the main changes are pretty simple:

  1. in sgseq, simulate 1 million reads at a time in batches
  2. in write_reads take an offset parameter to ensure the simulated reads all have a unique id
  3. in write_reads append to the output file rather than overwrite it

I can totally add unit tests in. Let me know what you'd prefer in terms of reopening a pull request and I'll d it. Thanks!

lcolladotor commented 8 years ago

@roryk I submitted a pull request to your proposed changes. See https://github.com/roryk/polyester/pull/1

Basically, I completed your change to R Travis v2 (with a note that metacran/r-builder might be necessary in the future). I also wrote an example on write_reads() that uses the offset argument and shows that you get the same results. It will work as a simple unit test for now. In the future @jtleek might be interested in implementing unit tests for all of polyester, but we'll leave that for another issue.

I looked at your changes in sqseq() and they make sense. I realize that adding an example that tests the usage of offset inside sgseq() doesn't seem very practical since you'll have to actually write more than 1 million of reads. So for now, I don't think that it makes sense to test it nightly on the Bioc builds.

I also updated the NEWS file to reflect the changes.

As for formatting, don't worry about it. I think that it just made your pull request seem very scary at first.


So, if it's all good, you'll accept my PR and then @jtleek will accept yours. He will then have to push the changes to the Bioconductor svn repository on the devel branch. We could also port the changes to the release branch if there is interest in doing so.

jtleek commented 8 years ago

This all sounds good but I think @roryk should add himself to the author list for the function and to the middle of the author list for the package as well. I already confirmed this was ok with Alyssa.

roryk commented 8 years ago

Thanks @lcolladotor for the cleanup of the pull request. Thanks @jtleek and @alyssafrazee, that is very generous to get added to the author list. I'll squash these commits into something more atomic.

roryk commented 8 years ago

The commit history is not too messy actually-- I can squash it if you'd like, or you can do it. Either way. Thanks for merging in the fix!

jtleek commented 8 years ago

Wait so am I waiting for someone to merge something or should I just accept this pull request now? @lcolladotor ?

lcolladotor commented 8 years ago

@jtleek Rory Kirchner already merged in my pull request to this branch. So you can go ahead and accept his pull request. Just note that you might also want to edit the DESCRIPTION file author's list (he only modified the author's list on the pkg help page) so that his contribution will be more noticeable.

You'll then have to push the changes to Bioc-devel svn. If you push them to the release branch, just change the version number accordingly.

roryk commented 8 years ago

Updated DESCRIPTION. Let me know if you'd like the commit history cleaned up or not and I'm happy to do that and rebase.

jtleek commented 8 years ago

Thanks so much @roryk and @lcolladotor

roryk commented 8 years ago

Thanks everyone!

lcolladotor commented 8 years ago

=)