MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

Number of mapped reads incorrect #64

Closed rmonti closed 6 years ago

rmonti commented 7 years ago

Hi Mike,

I've noticed that the number of mapped reads is incorrectly determined during one of my recent runs. These are the relevant sections of the log-file:

Wed Sep 13 09:28:05 PDT 2017
Starting alignment of BWTBX/SEQ/BWTBX.fq.gz with bowtie command:
    gzip -d -c BWTBX/SEQ/BWTBX.fq.gz | bowtie -q -v 1 -p 16 -S -a -m 500 --best --strata --sam-RG ID:BWTBX BWTBX/SEQ/Pismi1_AssemblyScaffolds -
    Completed. Results are in temporary file BWTBX/ShortStack_out/BWTBX_readsorted.sam.gz pending final processing
    Unique mappers: 1728654 / 23455757 (7.4 %)
    Multi mappers: 1534287 / 23455757 (6.5 %)
    Non mappers: 20192816 / 23455757 (86.1 %)

Wed Sep 13 09:36:35 PDT 2017
Processing and sorting alignments
    Working on BWTBX/ShortStack_out/BWTBX_readsorted.sam.gz
Summary of primary alignments:
    XY:Z:N -- Unmapped because no valid alignments: 20192816 / 23455757 (86.1 %)
    XY:Z:M -- Unmapped because alignment number exceeded option bowtie_m 500: 0 / 23455757 (0.0 %)
    XY:Z:O -- Unmapped because alignment number exceeded option ranmax 3 and no guidance was possible: 11167 / 23455757 (0.0 %)
    XY:Z:U -- Uniquely mapped: 1728654 / 23455757 (7.4 %)
    XY:Z:R -- Multi-mapped with primary alignment chosen randomly: 99203 / 23455757 (0.4 %)
    XY:Z:P -- Multi-mapped with primary alignment chosen based on u: 1423917 / 23455757 (6.1 %)
Alignment completed. Final bam file is BWTBX/ShortStack_out/BWTBX.bam

Wed Sep 13 09:44:16 PDT 2017
Performing de-novo cluster identification and analyses.

    At specified mincov of 0.5rpm with 23455757 mapped reads,
    mincov is 12 raw alignments
Completed at Wed Sep 13 09:48:40 PDT 2017

why does it say 23455757 mapped reads, when in fact that is the total number of reads?

MikeAxtell commented 7 years ago

Hi. This looks like poor wording on my part. For the mincov setting, ShortStack calculates the rpm value based on the total number of input reads. Reads that don't get placed still count in the denominator of the reads per million calculation. In essence it's a fraction of the library. Using the word 'mapped' there is misleading. Really it's the the number of input reads.

I'll make a note to update that reporting message in the next release. Thanks for the comment.

rmonti commented 7 years ago

Hi Mike,

If that is the case, could you also add the option to specify rpmm, i.e. reads per million mapped reads? For a lot of the analysis we do here at JGI that is the more sensible option, as the fraction of mapped reads can vary a lot based on the quality of the data and the reference genome.

Otherwise I would have to estimate the fraction of mapped reads first and then manually adjust the number of reads that trigger the cluster analysis for every sample, before running ShortStack.

It should be fairly simple to implement.

cheers, Remo

MikeAxtell commented 7 years ago

sure, I can add that switch. Stay tuned.

Whether or not to use rpm or rpmm to determine the depth of coverage at which a cluster is called is debatable. Using rpm has the advantage of always being a fraction of the total library size, which is why it's my preference. If using rpmm, then the same nominal setting of rpmm will give different sensitivities depending on the quality of the reference genome, and other things (like how many modified / non-templated nts there are in the small RNAs). For instance my group is doing a lot of work with small RNA data from samples that have multiple organisms.

Anyhow, thanks for the suggestion, and I will implement that one soon.

Mike

On Thu, Sep 14, 2017 at 1:36 PM, rmonti notifications@github.com wrote:

Hi Mike,

If that is the case, could you also add the option to specify rpmm, i.e. reads per million mapped reads? For a lot of the analysis we do here at JGI that is the more sensible option, as the fraction of mapped reads can vary a lot based on the quality of the data and the reference genome.

Otherwise I would have to estimate the fraction of mapped reads first and then manually adjust the number of reads that trigger the cluster analysis for every sample, before running ShortStack.

It should be fairly simple to implement.

cheers, Remo

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/64#issuecomment-329554932, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiWLfuoa9czHEBjA-NuID0wioPlueks5siWQHgaJpZM4PWcBN .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

rmonti commented 7 years ago

Hi Mike,

I haven't really considered those reasons for choosing rpm over rpmm, thanks for the insights on your part. I'm currently dealing with data that seems to have a very low fraction of reads mapped to the reference (20%), which I think is partially due to the presence of other organisms. If I then use rpm, the cutoff seems unnecessarily high to trigger the analysis. We usually like to predict more clusters than necessary (the assumption being that a lot of them are junk), rather than missing potential smRNA genes based on a stringent cutoff. Filtering low quality clusters based on certain parameters out is easy at later stages of the analysis. But that is a matter of preference and related to the scientific question, of course.

Cheers and thanks for maintaining ShortStack so well!

Remo

On Thu, Sep 14, 2017 at 11:12 AM, Mike Axtell notifications@github.com wrote:

sure, I can add that switch. Stay tuned.

Whether or not to use rpm or rpmm to determine the depth of coverage at which a cluster is called is debatable. Using rpm has the advantage of always being a fraction of the total library size, which is why it's my preference. If using rpmm, then the same nominal setting of rpmm will give different sensitivities depending on the quality of the reference genome, and other things (like how many modified / non-templated nts there are in the small RNAs). For instance my group is doing a lot of work with small RNA data from samples that have multiple organisms.

Anyhow, thanks for the suggestion, and I will implement that one soon.

Mike

On Thu, Sep 14, 2017 at 1:36 PM, rmonti notifications@github.com wrote:

Hi Mike,

If that is the case, could you also add the option to specify rpmm, i.e. reads per million mapped reads? For a lot of the analysis we do here at JGI that is the more sensible option, as the fraction of mapped reads can vary a lot based on the quality of the data and the reference genome.

Otherwise I would have to estimate the fraction of mapped reads first and then manually adjust the number of reads that trigger the cluster analysis for every sample, before running ShortStack.

It should be fairly simple to implement.

cheers, Remo

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/ 64#issuecomment-329554932, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiWLfuoa9czHEBjA- NuID0wioPlueks5siWQHgaJpZM4PWcBN .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/64#issuecomment-329564977, or mute the thread https://github.com/notifications/unsubscribe-auth/AWa6Cp-weM0QdQpuWrGcpFBGwAdJdG5Cks5siWyFgaJpZM4PWcBN .

MikeAxtell commented 6 years ago

Hi again. I just pushed commit 0dd0055 to deal with this request. Now the option --mincov will accept integers (for raw reads), rpm values (for reads per million), or rpmm (for reads per million mapped).

I haven't tagged it as a release yet; can you check it out and see if it is what you were looking for?

MikeAxtell commented 6 years ago

Release 3.8.5 adds this feature. Thanks again for the request.