MarioniLab / scran

Clone of the Bioconductor repository for the scran package.
https://bioconductor.org/packages/devel/bioc/html/scran.html
40 stars 22 forks source link

Comparing modelGeneVar with old function trendVar #55

Closed funpys closed 4 years ago

funpys commented 4 years ago

Hi. Thanks to your package, I've been normalized sc-RNA seq data and found HVGs for years. Now, I'm wondering the difference between updated function 'modelGeneVar' and old function trendVar. The results between two functions are quiet different. Also, I cannot find fitting method like loess and splind in modelGeneVar. Is there any way to see same results with trendVar from modelGeneVar? And what is the major difference between two functions?

LTLA commented 4 years ago

The results between two functions are quiet different.

Yes, because modelGeneVar() (or specifically, fitTrendVar()) represents a major improvement over trendVar() in terms of the flexibility of the trend fit and the robustness of the function. See below.

Also, I cannot find fitting method like loess and splind in modelGeneVar.

Yes, because the spline is no longer necessary. In fact, loess isn't necessary either; all fitting is done using lowess (note the lack of a "w" in the latter).

Is there any way to see same results with trendVar from modelGeneVar?

No, I'm afraid that under the hood they are too different to reproduce the latter exactly from the former. I don't think that the trend fits would be all that different, but I would say that fitTrendVar() is better.

Note the deprecation warning on trendVar; in the next release, it will be removed from the package. This warning has been present since September, which I consider to be fairly good advance notice. If you are performing a new analysis, I would suggest using modelGeneVar() in the future. If you need reproducibility for historical analyses, the only real solution is to use containers like Docker.

And what is the major difference between two functions?

fitTrendVar() uses a weighting scheme to balance the contribution of genes at each abundance. This ensures that the calculations are not dominated by the majority of low-abundance genes, allowing the trend to fit properly to the fewer high-abundance genes to the right of the plot.

fitTrendVar() also uses a more robust approach for performing the initial parametric fit, which in turn improves the performance of the subsequent non-parametric lowess fit.

You don't mention what differences are actually of concern to you. I suspect that the biggest difference is actually not the trend fit, but rather in how the p-values are calculated between modelGeneVar() and decomposeVar(). The latter uses a chi-squared test, the validity of which was a pleasant fiction. The former uses a z-test with the standard deviation estimated from the scatter around the trend, which is more conservative and somewhat more statistically defensible.

funpys commented 4 years ago

Wow, concise and clear. Thank you so much!

LTLA commented 4 years ago

I will also add that if you don't have a lot of high-abundance genes, fitTrendVar()'s trend gets a bit wonky at the right edge (understandably, because there's no information there). In such cases, modelGeneVarByPoisson() is quite useful for getting a stable curve at high abundances.