MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

No phasiRNA annotation on ShortStack 4.0.0? #117

Closed sebel76 closed 1 year ago

sebel76 commented 1 year ago

Dear Micheal (again),

I was curious of what the new version of ShortStack can return as miRNA annotation; thus I tried it! I have 63 sRNA libraries from seven species. Except for Brachypodium, I am mapping reads to large Pooideae genome (>5Gb) and one genome have huge chromosomes (>1Gb). However, I enjoy the improve sensitivity as well as the annotation for known miRNA done directly removing downstream annotation steps; it is great!

Using Conda, I used ShortStack v4.0.0 install on a MacOS 13.2.1 with a processor 3.6 GHz 10-Core Intel Core i9 and 128 GB 2667 MHz DDR4 memory. I ran ShortStack using --thread 20. Everything run very well, but I did not see a huge speedup compare to ShortStack 3.8.5; maybe this is related to my machine...?

One feature removed on ShortStack v4.0.0 is the detection and annotation of PHAS loci; which I really enjoyed on v3.8.5. So, I map the reads with ShortStack v4.0.0 using the --align_only option. Then, I am annotating miRNAs and phasiRNAs with ShortStack with v4.0.0 and 3.8.5, respectively. It will be convenient to get the annotation from a single software.

Do you consider to add it on ShortStack v4.0.0? Otherwise, what tool do you think that perform the best annotate phasiRNA?

Best, Sébastien

MikeAxtell commented 1 year ago

Thanks Sébastien,

I did remove phasing scoring in this version. I frankly have very low confidence in our older method's ability to accurately score these genome-wide. We published a paper hitting on this a few years ago ... Polydore et al. https://doi.org/10.1002/pld3.101

I suggest looking into PhaseTank https://phasetank.sourceforge.net , or into some of the other methods/tools we described in Polydore et al. You can filter ShortStack loci that are mainly 21/22 for many phasiRNAs. It's the 24nt phasiRNAs that are the hardest.

I biological terms, usually people who are interested in phasing are really interested in secondary siRNA clusters. In a Venn diagram, secondary siRNA loci fully encompass all phasiRNAs, and then more. The issue that we see is that the very clean "phasing" signature is often obscured by secondary cleavages and noise. So finding just highly-phased regions results in missing many other regions that are, in fact, secondary siRNAs, but too noisy for phasing to be detected.

Anyway, that's the reason I decided to drop phasing scoring. Sorry to disappoint.