RGLab / MAST

Tools and methods for analysis of single cell assay data in R
224 stars 57 forks source link

warning messages: In rbind(...) : number of columns of result is not a multiple of vector length #143

Closed wangjingshen closed 3 years ago

wangjingshen commented 3 years ago

Dear MAST team

I'm using MAST mixed model to deal with individual effect. I filter on genes with at least 10% expression, but get warnings, just like this, image I’m not sure how this came about and whether it has an impact on downstream. When I filter on genes with at least 20% expression the warnings go away. But,this may filiter many genes i want.

Thank you for any help you can provide!

gfinak commented 3 years ago

Please provide a minimum reproducible example. We can't help you if you don't provide code. We've no idea what your data look like or what you are actually doing.

wangjingshen commented 3 years ago

Thanks for your reply. Of course,here is my code: test<- zlm(~ group + factorA + factorB + factorC + factorD + (1|individual) , sca, method='glmer', ebayes=FALSE) And test_sca.rds is my test data filtered on genes with at least 10% expression.

test_sca.rds.zip

wangjingshen commented 3 years ago

And my package version:

MAST_1.14.0 
lme4_1.1-23  
data.table_1.12.8
gfinak commented 3 years ago

I can reproduce this warning but I only get one such warning. The data set is so large that it's not really practical for me to try and isolate the gene leading to this. If you can post a smaller set of data and pinpoint which gene gives this warning I can look into it, but I don't really have the time to chase it down at the moment. I suspect it's linked to genes where the model fails to converge or estimate certain coefficients, so I wouldn't worry about it in that sense, you're not getting information from those anyway.

wangjingshen commented 3 years ago

Thanks a lot. And I find the warning reported by 1720th gene(g12300). test_sca_zlm.zip

gfinak commented 3 years ago

The problem is that you cannot estimate two of the coefficients for gene g12300 in the continous model due to the design.

Browse[3]> object@fitC
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: response ~ 0 + `(Intercept)` + groupg_2 + factorA + factorBf_2 +      factorC + factorD + (1 | individual)
   Data: datpos
      AIC       BIC    logLik  deviance  df.resid 
 366.4424  385.3967 -177.2212  354.4424       168 
Random effects:
 Groups     Name        Std.Dev.
 individual (Intercept) 0.00    
 Residual               0.67    
Number of obs: 174, groups:  individual, 2
Fixed Effects:
`(Intercept)`       groupg_2        factorC        factorD  
      9.45195       -0.55502        1.41102       -0.09451  
fit warnings:
fixed-effect model matrix is rank deficient so dropping 2 columns / coefficients

Notice it drops factorBf and factorA.

So for factor Bf2

Browse[3]> table(factorbf2=object@modelMatrix[,"factorBf_2"],expressed=object@response>0)
         expressed
factorbf2 FALSE TRUE
        0  1061  174
        1   125    0

notice there are no cells of class factorBf_2 when the gene is expressed, so the continuous model can't estimate anything there for this gene.

Consequently it's combining coefficients across genes into a table but this does seem to be a bug. @amcdavid The code that does the collection of summaires in collectSummaries should probably be doing something like this:

do.call(rbind.fill,lapply(lapply(listOfSummaries,"[[", "coefC"),function(x)as.data.frame(t(x))))%>%head()

This uses rbind.fill from dplyr and ensures the column names are respected.

wangjingshen commented 3 years ago

I get it, thanks a lot.

gfinak commented 3 years ago

You can pull a fixed version from GitHub here.

amcdavid commented 3 years ago

Thanks for looking into this @gfinak, this actually should have been handled by the coef method which was intended to pad missing factor levels with NAs. I fixed it there (also could be an issue with method='bayesglm') and reverted back to the rbind.