Which kind of input matrix is better to estimate the number of surrogate variables?

Hello,

I am trying your package and I have a question about the estimation of n.sv. I see you recommend to calculate the n.sv on the y Y.r matrix of residuals after regression of the known confounding factors using the EstDimRMT function from ISVA, and then apply smartsva on the raw data matrix Y

Looking at ISVA documentation, the EstDimRMT is used on the raw matrix, without regression of known confounding factors.

When I try SmartSVA on well annotated local RNA expression data, the estimated number of surrogate components grows almost linearly with the amount of known confounding factors I add. This fact confuses me because I would expect to have fewer estimated surrogate vectors if I remove the effect of more and more known confounding factors.

Could you elaborate on the need to apply the estimation on the residual matrix after regression of known factors? Why is it better than applying the estimation on the raw data matrix as done in SVA an ISVA ?

Thanks for your help Mattia

Here the smippet from SmartSVA I am referring to :

##Methylation M values (CpG by Sample)
 Y <- matrix(rnorm(20*1000), 1000, 20)
 df <- data.frame(pred=gl(2, 10))
 ## Determine the number of SVs
 Y.r <- t(resid(lm(t(Y) ~ pred, data=df)))
 ## Add one to compensate potential loss of 1 degree of freedom
 ##  in confounded scenarios
 n.sv <- EstDimRMT(Y.r, FALSE)$dim + 1
 mod <- model.matrix( ~ pred, df)
 sv.obj <- smartsva(Y, mod, mod0=NULL, n.sv=n.sv)

cran / SmartSVA

Which kind of input matrix is better to estimate the number of surrogate variables? #1