MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

Low number of miRNAs identified #159

Open rmr74370 opened 1 week ago

rmr74370 commented 1 week ago

Hello! So I’m trying to use ShortStack for miRNA identification from sorghum roots samples. In order to test the efficiency of our small RNA library protocol we initially sequenced a small pool of 20 samples (which I will call Pool 1). When I ran ShortStack on them, they seemed to work fine and 37 miRNAs were identified. We then later wanted to test out more samples so we sequenced Pool 2 (consisting of 80 samples). However, this pool only yielded 12 miRNAs according to ShortStack. So I’m trying to figure out what might be causing this drastic difference between Pool 1 and Pool 2, especially since Pool 2 has more samples so I was expecting an equal or greater number of miRNAs to be identified.

Issue: Pool 2 has much fewer miRNAs identified than Pool 1 (12 miRNAs vs. 37) despite having more samples (80 samples vs. 20). Why?

Pool 1 –37 miRNAs Pool 2 –12 miRNAs

Sequencing read depth?

Pool 1&2, more than 5 million reads: 15 miRNAs Pool 1&2, more than 5 million reads, no outlier: 10 miRNAs Pool 1&2, more than 1 million reads, no outlier: 12 miRNAs

Bad sample interfering with algorithm? Sample size?

Pool 2, samples 1-20: 36 miRNAs Pool 2, samples 21-40: 0 miRNAs Pool 2, samples 41-60: 18 miRNAs Pool 2, samples 61-80: 10 miRNAs

Pool 2, samples 21-40, subset 1 (5 samples): 33 miRNAs Pool 2, samples 21-40, subset 2 (5 samples): 30 miRNAs Pool 2, samples 21-40, subset 3 (5 samples): 30 miRNAs Pool 2, samples 21-40, subset 4 (5 samples): 28 miRNAs

Additional Notes: I’ve been using version 3.8.5 since a labmate of mine used that version and I wanted to keep my results comparable to his. But I could switch to the most recent version if you think that would help. I’ve also been using all the defaults, though I have considered changing the --mincov to be something like 0.5 to increase sensitivity. I’ve also been using a Conda environment as well as the same script (just modifying which input samples) for all of the runs.

Do you have any ideas on why Pool 2 doesn’t seem to be working normally? Any help would be greatly appreciated. Thanks!

MikeAxtell commented 1 week ago

Hi Rachel, thanks for your message. It's hard to tell really. I definitely urge you to upgrade to the latest ShortStack. It has much better all around performance. 3.8.5 was mature code but was very conservative in calling loci true microRNA loci.

Running on super-deep data may cause strange things. I developed ShortStack largely with smaller scale experiment in mind, like 3-20 sRNA-seq libraries.

But, your sampling indicates read-depth is not the only variable. Here's one idea: Some microRNA libraries are low-quality in that they contain a lot of degraded bits of RNA. These are mostly outside of the 21-24nt size range. Could be that some of your libraries have large numbers of reads from degraded RNAs. These will affect ShortStack's calling .. any cluster that has < 80% of all alignments from reads <21 or >24 is automatically tagged as "DicerCall N", and cannot be annotated as a microRNA.

Anyway my best advice is this:

rmr74370 commented 1 week ago

Thank you so much for the feedback! I'll definitely try out your suggestions.

MikeAxtell commented 1 week ago

In case you are upgrading be advised that I am about to drop a new version, version 4.1.0, within the next few days. The new one has many improvements, especially with speed, than the current release. So maybe wait until 4.1.0 drops to upgrade.

rmr74370 commented 1 week ago

Sounds good, thank you!

MikeAxtell commented 5 days ago

Version 4.1.0 was just release. If using Bioconda wait a day or two for their system to catch up to the new release.