problem with argument "span“ in trendVar function

FTD2018 commented 5 years ago

Hi, Aaron Now I want to use trendVar to indentify HVGs in scRNA data, but I'm a little confused with the "span" argument. As you know, different "span" values lead to substantial difference in the downstream analysis. Do you have any suggestions for how to decide the "span" value for specific dataset? Looking forward for your reply!

LTLA commented 5 years ago

Using my psychic powers, I will determine what code you're running, and what kind of "substantial differences" you're talking about... Oh, wait. I don't have psychic powers.

Perhaps some examples would help? I don't expect all of this but some of it would be nice.

FTD2018 commented 5 years ago

Thanks for your reply And like the example in the Bioconductor:[https://bioconductor.org/packages/3.8/workflows/vignettes/simpleSingleCell/inst/doc/xtra-3-var.html]. the span is 0.3. And I use span value "seq(0.1, 1 , length=10)" to draw the plot: blue is lower span value, red is higher span value. So for this dataset , How can I decide which span value to use? Below is span: 0.3(red) and 0.4(blue)

I just wonder if there is an optimal value for the span like the https://stats.stackexchange.com/questions/2002/how-do-i-decide-what-span-to-use-in-loess-regression-in-r?

Thanks

LTLA commented 5 years ago

... are you fitting to the spike-ins? Or to the endogenous genes? Some code, please?

FTD2018 commented 5 years ago

The code is the same with the example: `library(BiocFileCache) bfc <- BiocFileCache("raw_data", ask=FALSE) wilson.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series", "GSE61nnn/GSE61533/suppl/GSE61533_HTSEQ_count_results.xls.gz"))

library(R.utils) wilson.name2 <- "GSE61533_HTSEQ_count_results.xls" gunzip(wilson.fname, destname=wilson.name2, remove=FALSE, overwrite=TRUE)

library(gdata) all.counts <- read.xls(wilson.name2, sheet=1, header=TRUE) rownames(all.counts) <- all.counts$ID all.counts <- as.matrix(all.counts[,-1])

library(SingleCellExperiment) sce.hsc <- SingleCellExperiment(list(counts=all.counts)) dim(sce.hsc)

is.spike <- grepl("^ERCC", rownames(sce.hsc)) isSpike(sce.hsc, "ERCC") <- is.spike summary(is.spike)

library(scater) sce.hsc <- calculateQCMetrics(sce.hsc)

libsize.drop <- isOutlier(sce.hsc$total_counts, nmads=3, type="lower", log=TRUE) feature.drop <- isOutlier(sce.hsc$total_features_by_counts, nmads=3, type="lower", log=TRUE) spike.drop <- isOutlier(sce.hsc$pct_counts_ERCC, nmads=3, type="higher")

sce.hsc <- sce.hsc[,!(libsize.drop | feature.drop | spike.drop)] data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), BySpike=sum(spike.drop), Remaining=ncol(sce.hsc))

to.keep <- nexprs(sce.hsc, byrow=TRUE) > 0 sce.hsc <- sce.hsc[to.keep,]

library(scran) sce.hsc <- computeSumFactors(sce.hsc) summary(sizeFactors(sce.hsc)) sce.hsc <- computeSpikeFactors(sce.hsc, type="ERCC", general.use=FALSE) summary(sizeFactors(sce.hsc, "ERCC")) sce.hsc <- normalize(sce.hsc)

################################################################## ##################### HVG selection ############################## ##################################################################

plot(var.out$mean, var.out$total, pch=16, cex=0.6, col="gray", xlab="Mean log-expression", ylab="Variance of log-expression")

library(RColorBrewer) colors=colorRampPalette(c("blue", "red"))(10) a = seq(0.1,1,length=10)

for (i in 1:10){ var.fit <- trendVar(sce.hsc, parametric=TRUE, loess.args=list(span=a[i])) var.out <- decomposeVar(sce.hsc, var.fit) curve(var.fit$trend(x), col=colors[i], lwd=2, add=TRUE) } `

LTLA commented 5 years ago

In other words, you're using the spike-ins.

Variability in the loess trend is to be expected when you fit the trend to spike-ins. There's maybe 70, 80 points? Not a lot of information to estimate a mean-variance trend, which makes the curve sensitive to the choice of algorithm and parameters used. This is especially true for the interval on the x-axis from 5 to 10, where the slope is steep but there are very few spike-ins to guide the location of the curve.

I wouldn't worry much about this. Just pick a middle-of-the-range span and make sure the curve is smooth and passes through or close to most of the spike-in points. The 0.3-0.4 curves look fine to me, and the differences in the trend are minimal. If you have a result that is being affected by such small changes, then you should reconsider the robustness of your conclusions.

To answer your last question, cross-validation with so few points (and so unevenly distributed throughout the abundance range) is probably a waste of time. In theory, method="spline" could do cross-validation, but trendVar sets df=4 as a fail-safe. I guess I could turn this off to get the default CV behaviour.

LTLA commented 5 years ago

To follow up: it is actually possible to force the issue by setting spline.args=list(df=0). This is a nonsensical df setting that gets ignored by smooth.spline (with a whole bunch of warnings), forcing it to use cross-validation. Unfortunately, the default cross-validation scheme gives something nonsensical:

crazy

It gets a little better with spline.args=list(df=0, cv=TRUE) (to turn off generalized cross-validation):

better

But the best-looking curve (to me) is still chosen manually, with spline.args=list(df=5):

best

FTD2018 commented 5 years ago

Thanks for your answer.

Before I'm just struggled with the optimal "span" value, but as you said, "make sure the curve is smooth and passes through or close to most of the spike-in points".

Maybe it is also important for considering the results of downstream analysis like dimensionality reduction and clustering with different HVGs. And if the results make sense, then we don't need worry about it.

At last, do you think it is necessary to add a function to the scran package for automatic select the span value? This is just my tentative idea.

Thanks

LTLA commented 5 years ago

Select a span... how, exactly? Cross-validation can do crazy things when you don't have a lot of points, as my plots above demonstrate. The default span (of 0.3, IIRC) might not be optimal in any particular case, but it is robust to many scenarios and you can rest assured that it will do a okay-ish job.

Do you have a specific problem with the defaults to require the use of an automatically selected span? Because if you don't have a problem, then just leave it as it is and don't worry about it. If you're worried about the choice of span being arbitrary, well, you'll probably get different results depending on what cross-validation scheme you chose, as I've shown above! (GCV vs CV, GCV vs AICC.)

In short, if it's going to be arbitrary anyway, might as well go with the simplest approach that works.

FTD2018 commented 5 years ago

Thanks. So the best way is deeply understand the characteristics of your scRNA data, then choose the suitable value. I got it.

I only just started to learn the analysis of scRNA data, so....

Anyway , thanks Xiaopeng

LTLA commented 5 years ago

Well, your understanding doesn't have to be deep. Just make sure the curve isn't lumpy (span too small) and doesn't miss a lot of spike-in points (span too large). Or in other words - if it doesn't look obviously wrong, don't worry about it. There are much more difficult steps in scRNA-seq data analysis (e.g., annotation and interpretation of clusters and trajectories) that makes this seem like a trifling matter.

FTD2018 commented 5 years ago

Thanks for your advise. It's a pleasant communication. Thanks!

MarioniLab / scran

problem with argument "span“ in trendVar function #35