gene_error=TRUE/"Model Fit Error" interpretation

bazelep commented 2 years ago

Dear Authors, I don't have a bug issue, but am wondering how to interpret the per-gene "Model Fit Error" / gene_error = TRUE event. Thank you for any insight. Regards, Peter Bazeley

edvanburen commented 2 years ago

Hello Peter,

Thanks for your message. I would say that when a gene error is received, there are a few possible explanations (which can be operating separately or together):

(1) A zero-inflated model may not be a good assumption for the particular gene being fit. For example, total gene expression or the proportion of cells with zero expression may be too high or too low. As an extreme example: if there are no zero values observed for expression, and one tries to fit a zero-inflated model, the estimated intercept will approach negative infinity and estimation of the covariance matrix may have problems.

(2) If the observed counts are not overdispersed (or, more strangely in this context, are underdispersed), the estimate of the overdispersion parameter phi from the negative binomial will approach infinity. I have encountered this problem for genes with very low expression, and it can lead to an undefined covariance matrix.

(3) If you are fitting a model with random effect terms, there is not sufficient data to predict them accurately. This could be due to a limited number of individuals (e.g. if less than three individual donors).

Regardless of these above ideas, if you are seeing errors for many genes it may be worth filtering genes using some arbitrary threshold to keep only genes with a "somewhat reasonable" amount of expression (e.g. the kinds of rules used by the Seurat package, or just keeping the top X genes based on total expression). Statistically, this must be done with care, but power will be very low for genes that have very sparse expression anyways.

If filtering out genes with sparse expression does not help, another idea is to reduce the model complexity and see if model fitting improves. I would start by removing random effect terms, if included in the model, then removing the zero-inflation component, if included in the model, to see if inference improves. We showed the missing random effect terms can have a large impact on type-I error, so I recommend including them if at all possible. Unfortunately, though, some datasets do not lend themselves to including random effect terms if, for example, there are relatively few individuals or few cells per individual.

Does this help?

Eric

bazelep commented 2 years ago

Hi Eric,

Thank you, this is very helpful.

Regards, Peter

edvanburen / twosigma

gene_error=TRUE/"Model Fit Error" interpretation #3