MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

multiply mapped sRNA in Counts.txt #127

Closed AlineMuyle closed 1 year ago

AlineMuyle commented 1 year ago

Dear Mike, First of all thank you so much for this amazing tool. I was wondering how multiply mapped sRNA are counted in the output counts.txt file? If I understood correctly, each sRNA is counted only once even if multiply mapped? And the loci that gets the count is attributed depending on the weights calculated as explained in Johnson et al 2016, is that right? So the sRNA would have a higher chance to be counted in the loci which has most uniquely mapped reads, but there is also a lower probability that it ends up counted for the other loci which has fewer uniquely mapped reads, is that so?

Another unrelated question, how many MIRNA do you usually identify in a plant with your de novo search for MIRNA when analyzing a few sRNA libraries?

Thanks a lot

MikeAxtell commented 1 year ago

Alignment decisions and the counts listed in Counts.txt (and Results.txt) are two separate phases.

Alignment decisions for multi-mapping reads are as decsribed in Johnson et al.. The default mode is fractional weighting. The decision on where to place each multi-mapped read is based on weightings of all alignments in the local area. It's not just the uniquely-mapped reads in the area that contribute to the weighting ... multi-mappers are given fractional counts. Once placed, each read has a single alignment position.

For counting by ShortStack, it just counts the single alignment positions for each read. ShortStack will accept any BAM file, not just those performed with the ShortStack alignment wrapper. Thus, how the multimapped reads are handled in read counting is up to the aligner used. ShortStack counts are just counts of reads that are in the BAM file (exlcuding any alignments tagged as secondary alignments, if present).

For your second question, it really depends on the specifics of the data. 60-100 is a good ballpark number.

AlineMuyle commented 1 year ago

Thanks a lot!