Can variancePartition handle TPM normalized counts as the input?

evaesquinas commented 2 years ago

Hello Dr. Gabriel Hoffman,

In this issue you said that the variancePartition software deals with log2 CPM rather than modeling the counts directly and the produced residuals are log2 CPM corrected for covariates.

Checking the vignette and the section "inputs" it appears that the gene matrix that you have to use as your input has to be normalized and counts can be normalized using counts per million (CPM), reads per kilobase per million (RPKM) or fragments per kilobase per million (FPKM).

I am using log2(TPM counts + 1) as my input, is it a problem for the function? Can variancePartition software handle TPM counts? Because maybe I am getting incorrect results and I do not know.

Look forward to your reply,

Thanks very much in advance

Regards

GabrielHoffman commented 2 years ago

Variance partitioning works on any of those inputs (log2 CPM, TMP, RPKM, FPKM). But the voom() method to estimate precision weights is based on having raw counts and estimating the empirical mean-variance trend. Then voom()/voomWithDreamWeights() produces log2 CPM, and the corresponding weights.

You can use other quantification methods, but you need to skip the voom step.

GabrielHoffman commented 2 years ago

Just to clarify, voom()/voomWithDreamWeights() directly produces log2 CPM values used on downstream analysis:

library(variancePartition)
library(edgeR)

data(varPartDEdata)

# normalize RNA-seq counts
dge = DGEList(counts = countMatrix)
dge = calcNormFactors(dge)

# specify formula with random effect for Individual
form <- ~ Disease 

# compute observation weights
vobj = voomWithDreamWeights( dge[1:20,], form, metadata)

# log2 CPM
vobj$E[1:3, 1:3]
#>                              sample_01 sample_02 sample_03
#> ENST00000570099.1 gene=YPEL3  1.626165  1.714455  1.210451
#> ENST00000589123.1 gene=NFIC   5.273260  5.584833  6.035297
#> ENST00000360314.3 gene=CASS4  4.277136  5.050188  4.492905

# precision weights
vobj$weights[1:3, 1:3]
#>          [,1]     [,2]     [,3]
#> [1,] 4.656509 4.752344 4.750688
#> [2,] 3.833944 3.055607 3.080704
#> [3,] 4.952305 6.050839 6.111942

evaesquinas commented 2 years ago

But what if you are not using voom? If you are using the "standard application" your get CPM values too?

For example:

library('variancePartition')
# load simulated data:
# geneExpr: matrix of gene expression values
# info: information/metadata about each sample
data(varPartData)
form <- ~ Age + (1|Individual) + (1|Tissue) + (1|Batch)
browser()
varPart <- fitExtractVarPartModel( geneExpr, form, info )

Does varPart contain log2 CPM values? Because you answered here that residuals are log2 CPM corrected for covariates, so... the values from the covariates (batch, individual, tissue, age) are log2CPM corrected too?

head(varPart)
             Batch Individual     Tissue          Age  Residuals
gene1 1.579468e-04  0.8903741 0.02468833 4.528902e-05 0.08473430
gene2 0.000000e+00  0.8060388 0.10101116 3.336694e-04 0.09261633
gene3 2.422457e-03  0.8901125 0.03561715 1.471721e-03 0.07037622
gene4 0.000000e+00  0.7688275 0.12531360 1.014416e-03 0.10484444
gene5 5.418753e-11  0.6997313 0.20909304 3.871505e-05 0.09113695
gene6 2.347128e-03  0.7222111 0.16787306 2.717284e-03 0.10485138

Thanks again for your help.

GabrielHoffman commented 2 years ago

The function fitExtractVarPartModel() returns the fraction of variance explained by each variable in the formula, and the remaining variation. These are fractions, so you get the fraction of residual variance for each gene.

If you want the full residuals per gene and sample (i.e. not the residual fraction), you can run this code to get the log2 CPM removing variation explained by variables in the formula.

library('variancePartition')
data(varPartData)
form <- ~ Age + (1|Individual) + (1|Tissue) + (1|Batch)
fit <- dream( geneExpr, form, info )

res = residuals(fit)

evaesquinas commented 2 years ago

Okay, now I understand everything better because I was a bit confused with the concepts. Thank you so much for your help and time.

GabrielHoffman / variancePartition

Can variancePartition handle TPM normalized counts as the input? #62