Closed rmonti closed 6 years ago
Hi. This looks like poor wording on my part. For the mincov setting, ShortStack calculates the rpm value based on the total number of input reads. Reads that don't get placed still count in the denominator of the reads per million calculation. In essence it's a fraction of the library. Using the word 'mapped' there is misleading. Really it's the the number of input reads.
I'll make a note to update that reporting message in the next release. Thanks for the comment.
Hi Mike,
If that is the case, could you also add the option to specify rpmm, i.e. reads per million mapped reads? For a lot of the analysis we do here at JGI that is the more sensible option, as the fraction of mapped reads can vary a lot based on the quality of the data and the reference genome.
Otherwise I would have to estimate the fraction of mapped reads first and then manually adjust the number of reads that trigger the cluster analysis for every sample, before running ShortStack.
It should be fairly simple to implement.
cheers, Remo
sure, I can add that switch. Stay tuned.
Whether or not to use rpm or rpmm to determine the depth of coverage at which a cluster is called is debatable. Using rpm has the advantage of always being a fraction of the total library size, which is why it's my preference. If using rpmm, then the same nominal setting of rpmm will give different sensitivities depending on the quality of the reference genome, and other things (like how many modified / non-templated nts there are in the small RNAs). For instance my group is doing a lot of work with small RNA data from samples that have multiple organisms.
Anyhow, thanks for the suggestion, and I will implement that one soon.
Mike
On Thu, Sep 14, 2017 at 1:36 PM, rmonti notifications@github.com wrote:
Hi Mike,
If that is the case, could you also add the option to specify rpmm, i.e. reads per million mapped reads? For a lot of the analysis we do here at JGI that is the more sensible option, as the fraction of mapped reads can vary a lot based on the quality of the data and the reference genome.
Otherwise I would have to estimate the fraction of mapped reads first and then manually adjust the number of reads that trigger the cluster analysis for every sample, before running ShortStack.
It should be fairly simple to implement.
cheers, Remo
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/64#issuecomment-329554932, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiWLfuoa9czHEBjA-NuID0wioPlueks5siWQHgaJpZM4PWcBN .
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
Hi Mike,
I haven't really considered those reasons for choosing rpm over rpmm, thanks for the insights on your part. I'm currently dealing with data that seems to have a very low fraction of reads mapped to the reference (20%), which I think is partially due to the presence of other organisms. If I then use rpm, the cutoff seems unnecessarily high to trigger the analysis. We usually like to predict more clusters than necessary (the assumption being that a lot of them are junk), rather than missing potential smRNA genes based on a stringent cutoff. Filtering low quality clusters based on certain parameters out is easy at later stages of the analysis. But that is a matter of preference and related to the scientific question, of course.
Cheers and thanks for maintaining ShortStack so well!
Remo
On Thu, Sep 14, 2017 at 11:12 AM, Mike Axtell notifications@github.com wrote:
sure, I can add that switch. Stay tuned.
Whether or not to use rpm or rpmm to determine the depth of coverage at which a cluster is called is debatable. Using rpm has the advantage of always being a fraction of the total library size, which is why it's my preference. If using rpmm, then the same nominal setting of rpmm will give different sensitivities depending on the quality of the reference genome, and other things (like how many modified / non-templated nts there are in the small RNAs). For instance my group is doing a lot of work with small RNA data from samples that have multiple organisms.
Anyhow, thanks for the suggestion, and I will implement that one soon.
Mike
On Thu, Sep 14, 2017 at 1:36 PM, rmonti notifications@github.com wrote:
Hi Mike,
If that is the case, could you also add the option to specify rpmm, i.e. reads per million mapped reads? For a lot of the analysis we do here at JGI that is the more sensible option, as the fraction of mapped reads can vary a lot based on the quality of the data and the reference genome.
Otherwise I would have to estimate the fraction of mapped reads first and then manually adjust the number of reads that trigger the cluster analysis for every sample, before running ShortStack.
It should be fairly simple to implement.
cheers, Remo
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/ 64#issuecomment-329554932, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiWLfuoa9czHEBjA- NuID0wioPlueks5siWQHgaJpZM4PWcBN .
-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/64#issuecomment-329564977, or mute the thread https://github.com/notifications/unsubscribe-auth/AWa6Cp-weM0QdQpuWrGcpFBGwAdJdG5Cks5siWyFgaJpZM4PWcBN .
Hi again. I just pushed commit 0dd0055 to deal with this request. Now the option --mincov will accept integers (for raw reads), rpm values (for reads per million), or rpmm (for reads per million mapped).
I haven't tagged it as a release yet; can you check it out and see if it is what you were looking for?
Release 3.8.5 adds this feature. Thanks again for the request.
Hi Mike,
I've noticed that the number of mapped reads is incorrectly determined during one of my recent runs. These are the relevant sections of the log-file:
why does it say 23455757 mapped reads, when in fact that is the total number of reads?