MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

Increasing --pad results in more loci found #61

Closed groverj3 closed 7 years ago

groverj3 commented 7 years ago

I've been using Shortstack to analyze small RNA reads from Brassica rapa to determine siRNA loci, their expression, etc.

I'm trying to determine the optimal values to use for the --mincov and --pad options for our data. Which I know have a drastic effect on the number of loci that are defined. For --mincov I decided to run shortstack over a range of values for this option. As it increased, the number of loci defined by shortstack decreased, as expected, and began to level off. I picked a value that appeared to be an inflection point to use for this option.

Varying the --pad option produced an unexpected result. According to the 2013 shortstack paper increasing --pad values should lead to fewer loci being defined, as one would expect based upon merging of ever further away clusters. However, increasing the --pad option over a range of values from 1nt to 1000nt (which includes the default of 75) resulted in more loci being defined. However, this was only true until I used values much greater than the default, replicating the result shown in figure 3B of the paper, where increasing --pad by orders of magnitude did decrease the number of loci defined, as expected.

I can't seem to wrap my head around how increasing --pad would actually drastically increase the number of loci (by a few thousand) for values relatively near (same order of magnitude) the default. I know that much about shortstack has changed since the 2013 paper and perhaps there is more complexity in its algorithms than I understand at first glance. Regardless, I was wondering if you or someone else over there could explain how this result could occur, and what your recommendation for picking an appropriate --pad value is.

loci_vs_pad

MikeAxtell commented 7 years ago

Hi Jeff. I took me a little time to refresh myself with my own code to answer this.

The answer is that the description of cluster finding in the 2013 RNA paper no longer applies exactly to the current version of ShortStack. If you look carefully at the README (or the code), you'll see that --mincov, as opposed to the definition in the paper / old versions, is now defined as

"-mincov [string] : Clusters of small RNAs must have at least this many alignments. Supply an integer between 1 and 50000. Can also be a normalized value in reads per million. When specifying mincov in reads per million, the mincov value must be a floating point number > 0 and < 500,000 followed by the string 'rpm'. Examples: '5' --> threshold is 5 raw reads. '3.2rpm' --> threshold is 3.2 reads per million mapped. Deafult: 0.5rpm."

So, if you start from extremely small sizes of --pad and increase, you are joining lots of adjacent proto-clusters that, on their own don't have enough reads to be kept, but when joined by increasing levels of--pad, do. Eventually, it looks like you hit an inflection point where further merging / increasing of pad is mostly merging proto-clusters that would have had enough reads to stand on their own, so you begin to have a net loss of clusters as you continue to merge.

About how to pick an appropriate level .. not sure. We don't have, to my knowledge, any solid ground truth / gold standards for what a 'cluster' is, particularly in plants for the most abundant type of endogenous small RNAs, the P4 siRNAs. So it seems like a personal preference to me.

Best, Mike

groverj3 commented 7 years ago

I took a look at the code, and with your explanation, this completely explains the result above. Yes, it'd be difficult to pick a "correct" value for --pad. It seems reasonable that proto-clusters which are ~75nt apart should be called different loci. I'll go ahead with the analysis using the default as well as a few other values to see what the effect on downstream steps are.

Thanks for the explanation, Jeff

MikeAxtell commented 7 years ago

Thanks for sharing the sweet curve! I bet the location of the inflection point depends strongly on the setting of --mincov too, as well as the genome, the depth of sequencing, and so forth, but probably the shape of the curve is about the same no matter what.

On Wed, Jul 19, 2017 at 6:23 PM, Jeffrey Grover notifications@github.com wrote:

I took a look at the code, and with your explanation, this completely explains the result above. Yes, it'd be difficult to pick a "correct" value for --pad. It seems reasonable that proto-clusters which are ~75nt apart should be called different loci. I'll go ahead with the analysis using the default as well as a few other values to see what the effect on downstream steps are.

Thanks for the explanation, Jeff

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/MikeAxtell/ShortStack/issues/61#issuecomment-316536563, or mute the thread https://github.com/notifications/unsubscribe-auth/AGiXiSntBaEWju7TRptuB7z5lIqcDvYQks5sPoHzgaJpZM4OdHEV .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell