Generating zero counts - Githubissues

alyssafrazee / polyester

Bioconductor package "polyester", devel version. RNA-seq read simulator.

89 stars 51 forks source link

Hello,

Thanks for writing this package.

I've been tinkering around with simulate_experiment(), and I noticed that if the mean is small, then you will inevitably sample zeroes from rnbinom(), but polyester replaces all zeroes with 1:

from NB.R:

    numreads[numreads == 0] = 1

What is the reasoning behind this? If you're simulating from a large transcriptome, you would expect many of these things to get zero counts, no?

I have a fork where I'll be changing the behavior along with some other things:

return a list with:
- data.frame of counts
- data.frame of size parameters
fix a bug when basemeans comes out to zero (it generates NaN in the rnbinom function)

I'm happy to share if any of these are features you think are helpful.

Thanks a bunch.

Thanks for the comment. I believe the replacing of all zero counts with 1's was a bugfix from an early version of the software; I don't think it's necessary with the way things are set up here. I'll consider fixing in a later version. The end result of the simulation won't be too much different if there's 1 read (vs. 0) coming from specific transcripts.

basemeans shouldn't come out to 0 ever: it's wrapped in a "ceiling" call, so would only be 0 if fold_changes were 0 or reads_per_transcript were 0. Setting fold_changes to 0 doesn't make sense, and reads_per_transcript shouldn't be 0 (since if you don't want reads from that transcript, it shouldn't be included in the input fasta file in the first place). I'll make this more explicit in the argument checking. I'm curious about how you got basemeans to come out to zero in another case?

Thanks for the input; feel free to make a PR with your fork!

alyssafrazee / polyester

Generating zero counts #15