Low number of miRNAs identified

rmr74370 commented 1 week ago

Hello! So I’m trying to use ShortStack for miRNA identification from sorghum roots samples. In order to test the efficiency of our small RNA library protocol we initially sequenced a small pool of 20 samples (which I will call Pool 1). When I ran ShortStack on them, they seemed to work fine and 37 miRNAs were identified. We then later wanted to test out more samples so we sequenced Pool 2 (consisting of 80 samples). However, this pool only yielded 12 miRNAs according to ShortStack. So I’m trying to figure out what might be causing this drastic difference between Pool 1 and Pool 2, especially since Pool 2 has more samples so I was expecting an equal or greater number of miRNAs to be identified.

Issue: Pool 2 has much fewer miRNAs identified than Pool 1 (12 miRNAs vs. 37) despite having more samples (80 samples vs. 20). Why?

Pool 1 –37 miRNAs Pool 2 –12 miRNAs

Sequencing read depth?

Pool 1: 20 samples; Average read depth of ~6.7 million reads (~⅕ of the samples had less than 5 million reads)
Pool 2: 80 samples; Average read depth of ~6.6 million reads (a little more than a third of the samples had less than 5 million reads); one outlier with 75 million reads (average without outlier is ~5.8 million reads).
To test if sequencing read depth played a role in why Pool 2 didn’t have as many miRNAs I ran ShortStack but filtered out lower quality reads. I also decided to run Pool 1 and 2 together, because theoretically the miRNAs identified in pool 1 should show up in the results even if they aren’t found in pool 2. Results:

Pool 1&2, more than 5 million reads: 15 miRNAs Pool 1&2, more than 5 million reads, no outlier: 10 miRNAs Pool 1&2, more than 1 million reads, no outlier: 12 miRNAs

Unfortunately, despite running pool 1 and 2 together and filtering out lower quality reads, there were still a very low number of miRNAs identified. Why?

Bad sample interfering with algorithm? Sample size?

I then wondered if maybe the difference in sample size had any impact of the algorithm (since pool 1 had 20 samples and pool 2 had 80). Additionally I wondered if there was one bad sample that was maybe formatted wrong or had some other issue that was messing up the ShortStack run somehow.
So I divided Pool 2 into sets of 20 and ran them separately. The results are below:

Pool 2, samples 1-20: 36 miRNAs Pool 2, samples 21-40: 0 miRNAs Pool 2, samples 41-60: 18 miRNAs Pool 2, samples 61-80: 10 miRNAs

I found these results interesting since the first set (1-20) had 36 miRNAs identified, which is comparable to pool 1.
The 3rd (41-60) and 4th (61-80) sets didn’t surprise me too much since I ordered the samples in order of decreasing total miRNA counts according to the results from running Pool 2 initially. So I would expect the later sets to have fewer miRNAs.
But set 2 (21-40) having 0 miRNAs identified is a little strange. So to see if there might be a problem sample mixed in somewhere I further subdivided it into four subsets of 5 samples each. The results are below:

Pool 2, samples 21-40, subset 1 (5 samples): 33 miRNAs Pool 2, samples 21-40, subset 2 (5 samples): 30 miRNAs Pool 2, samples 21-40, subset 3 (5 samples): 30 miRNAs Pool 2, samples 21-40, subset 4 (5 samples): 28 miRNAs

These results are a little confusing to me since they all seem fine. So why did running samples 21-40 together cause issues, but running them in sets of 5 was fine?

Additional Notes: I’ve been using version 3.8.5 since a labmate of mine used that version and I wanted to keep my results comparable to his. But I could switch to the most recent version if you think that would help. I’ve also been using all the defaults, though I have considered changing the --mincov to be something like 0.5 to increase sensitivity. I’ve also been using a Conda environment as well as the same script (just modifying which input samples) for all of the runs.

Do you have any ideas on why Pool 2 doesn’t seem to be working normally? Any help would be greatly appreciated. Thanks!

MikeAxtell commented 1 week ago

Hi Rachel, thanks for your message. It's hard to tell really. I definitely urge you to upgrade to the latest ShortStack. It has much better all around performance. 3.8.5 was mature code but was very conservative in calling loci true microRNA loci.

Running on super-deep data may cause strange things. I developed ShortStack largely with smaller scale experiment in mind, like 3-20 sRNA-seq libraries.

But, your sampling indicates read-depth is not the only variable. Here's one idea: Some microRNA libraries are low-quality in that they contain a lot of degraded bits of RNA. These are mostly outside of the 21-24nt size range. Could be that some of your libraries have large numbers of reads from degraded RNAs. These will affect ShortStack's calling .. any cluster that has < 80% of all alignments from reads <21 or >24 is automatically tagged as "DicerCall N", and cannot be annotated as a microRNA.

Anyway my best advice is this:

Upgrade to the latest ShortStack. (it's easy, it's on bioconda).
Use a smaller subset of the your data to run a full ShortStack to get clusters and annotations.
Use ShortStack's count mode to quantify sRNA expression in these clusters, this can be used for downstream work such as differential expression analysis.

rmr74370 commented 1 week ago

Thank you so much for the feedback! I'll definitely try out your suggestions.

MikeAxtell commented 1 week ago

In case you are upgrading be advised that I am about to drop a new version, version 4.1.0, within the next few days. The new one has many improvements, especially with speed, than the current release. So maybe wait until 4.1.0 drops to upgrade.

rmr74370 commented 1 week ago

Sounds good, thank you!

MikeAxtell commented 5 days ago

Version 4.1.0 was just release. If using Bioconda wait a day or two for their system to catch up to the new release.

MikeAxtell / ShortStack

Low number of miRNAs identified #159