GabrielHoffman / variancePartition

Quantify and interpret divers of variation in multilevel gene expression experiments
http://gabrielhoffman.github.io/variancePartition/
57 stars 14 forks source link

Error in if (inherits(possibleError, "error") && grep("the fixed-effects model matrix is column rank deficient", : missing value where TRUE/FALSE needed #22

Closed rachadele closed 4 years ago

rachadele commented 4 years ago

I keep getting this error when I try to add a specific variable to my formula for the fitExtractVarPartModel function. The variable raising the error are GEO accession numbers representing samples. I converted them from characters to factors. Also, for some reason I get a warning when trying to model age of sample participants as a continuous variable as in the example script.

Here is my formula function:

form <- ~(1|geo_accession)+(1|source_name_ch1)+(1|characteristics_ch1.1) +(1|characteristics_ch1.2)+(1|characteristics_ch1.4)+(1|characteristics_ch1.5)+(1|characteristics_ch1.6)

advice?

GabrielHoffman commented 4 years ago

What error are you getting and can you give a minimal reproducible example?

rachadele commented 4 years ago

The error I'm getting is the title of the issue:

Error in if (inherits(possibleError, "error") && grep("the fixed-effects model matrix is column rank deficient", : missing value where TRUE/FALSE needed

I'm getting it when I run the following command:

varPart <- fitExtractVarPartModel( ExpSet, form, info )

where ExpSet is an expression set extracted from a GSE file, form is the formula I specified above, and info is a data frame extracted from the expression set whose columns are the formula variables. I converted all the variables' data types to factor within 'info' as well as 'ExpSet'. I hope this helps! If necessary I can also give you the actual data I'm working with.

Thanks!

GabrielHoffman commented 4 years ago

You mentioned that the problem arises when the "GEO accessions" are added to the model. In order to meaningfully contribute to the model, a variable must have at least 2 observations for most levels. If "GEO accessions" are unique to each sample, that would explain your error. If there is only one sample per level, there is no variation within levels, so the model is degenerate and reports that the model is " rank deficient".

In such a case, the biological interpretation of this variable is also problematic.

rachadele commented 4 years ago

Okay, thank you! I am trying to apply the generic variance partition pipeline to an expression set using individual, age, batch, and several other phenotypic characteristics as my variables. I thought I could use GEO accession number to represent the individual variable but I suppose that's not the correct way to do it.

When using variance partition with GEO Expression Set objects, is there a specific data frame I should be looking for to extract all of the info I need? So far I have been using a data frame within the expression set where the rows are all of the sample IDs (GEO accession numbers) and the columns are the phenotypic variables I want to use for my model. Is this the correct approach to running variancePartition on an expression set, or am I misunderstanding something?

Thanks again!

rachadele commented 4 years ago

Also, there is a column with the sample IDs which I was trying to use to represent the individual donor's identity.

GabrielHoffman commented 4 years ago

Can you point me to a specific example?

rachadele commented 4 years ago

gset <- getGEO("GSE56035", GSEMatrix =TRUE, getGPL=FALSE) expSet <- gset[["GSE56035_series_matrix.txt.gz"]] info <- gset@phenoData@data

GabrielHoffman commented 4 years ago

Here it takes a little parsing to extract the Individual

library(GEOquery)
library(variancePartition)
library(BiocParallel)

register(SerialParam())

gset <- getGEO("GSE56035", GSEMatrix =TRUE, getGPL=FALSE) 
expSet <- gset[["GSE56035_series_matrix.txt.gz"]]

obj = phenoData(gset$GSE56035_series_matrix.txt.gz)

info = pData(obj)

info$Individual = gsub("\\..*", "", info$title)

form <- ~(1|Individual)+(1|source_name_ch1)+(1|characteristics_ch1.1) +(1|characteristics_ch1.2)+(1|characteristics_ch1.4)+(1|characteristics_ch1.5)+(1|characteristics_ch1.6)

# This will work, but expression is on raw scale instead of log scale
# vp = fitExtractVarPartModel( expSet, form, info )

# Convert gene expression to log scale
# I don't know why some ray values are negative, so set them to a small value
geneExpr = exprs(expSet)
geneExpr[geneExpr<0] = 1e-4
geneExpr = log(geneExpr)

vp = fitExtractVarPartModel( geneExpr[1:100,], form, info )

plotVarPart(vp)
rachadele commented 4 years ago

Thank you! This helped a lot.

I also receive the following error message:

Error in checkModelStatus(fitInit, showWarnings = showWarnings, colinearityCutoff = colinearityCutoff) : Categorical variables modeled as fixed effect: Age The results will not behave as expected and may be very wrong!!

where:

info$Age <- gsub("age |age|\: |\(yrs|\)", "", info$characteristics_ch1)

from what I understand from the model data set age is supposed to be modeled as a fixed effect since it's continuous?

GabrielHoffman commented 4 years ago

Yes, but you need to convert it from a string to ‘numeric’

rachadele commented 4 years ago

Thank :-)

rachadele commented 4 years ago

I'm also getting the same column-rank error when I try to compare variances for individuals within the same cell type:

form <- ~ (Cell_type+0|Individual) + (1|Cell_type)

vp = fitExtractVarPartModel( geneExpr, form, info ) Error in if (inherits(possibleError, "error") && grep("the fixed-effects model matrix is column rank deficient", : missing value where TRUE/FALSE needed

GabrielHoffman commented 4 years ago

You have 2 issues here.

1) Specifying ~ (Cell_type+0|Individual) is is problematic because it analyzes variation across cell types within each individual. But in the dataset, there are some individuals that aren't observed in both cells types. This makes the model rank deficient, so you get that error.

To see this you can run xtabs to a get all combinations of the two variables. You can then only include individuals observed in both cell types.

tab = xtabs(~Cell_type + Individual, info)
idx = which(apply(tab, 2, min) > 0)
include = info$Individual %in% names(idx)

form <- ~ (Cell_type+0|Individual) 
vp = fitExtractVarPartModel( geneExpr[1:100,include], form, info[include,] )

2) The second problem is that once you resolve the first problem, you will get this second error.

Error in checkModelStatus(fitInit, showWarnings = showWarnings, colinearityCutoff = colinearityCutoff) : 
  Random slope models {i.e. ~ (var1 | var2) } are no longer supported
for estimating variance fractions.
They produced results that were not interpretable.

Since I wrote the code 5 years, it has become clearer that the interpretation of random slope models is problematic. The genes are still ranked correctly, but the variance fractions themselves are not interpretable because then don't sum to 100. Hence this error now.

If that is not enough to dissuade you, you can still run the model with

vp = fitExtractVarPartModel( geneExpr[1:100,include], form, info[include,], showWarnings=FALSE )