andreyshabalin / MatrixEQTL

Matrix eQTL: Ultra fast eQTL analysis via large matrix operations
53 stars 16 forks source link

Understanding "Colinear or zero covariates detected" error message #23

Closed RegnerM2015 closed 1 year ago

RegnerM2015 commented 1 year ago

Hi @andreyshabalin,

Thank you for an amazing resource!

I am troubleshooting what I think is an environment-related issue when running Matrix_eQTL_Main(). In one environment (generic set of R libraries on an HPC cluster), the code runs as expected without errors, but when I run the same code and data in a different environment (conda or singularity container), I receive: "Colinear or zero covariates detected"

I believe this happens here under Covariates processing:

https://github.com/andreyshabalin/MatrixEQTL/blob/7e5c1cb7a7a0c2afe4a7efd697e0de7b99b44729/R/Matrix_eQTL_engine.R#L1502-L1527

My .Machine$double.eps is the same across environments and it looks like the qr functions are referencing the same namespace.

I am not sure where to go or what to try next so if you could provide additional insights or suggestions, that would be greatly appreciated!

andreyshabalin commented 1 year ago

Can you share with me the set of covariates you are using?

RegnerM2015 commented 1 year ago

I have an RDS file for the SlicedData object of the covariates. How would you like me to share this file with you?

For additional context, I am trying to use modelLINEAR_CROSS to test an interaction between conditions (disease v. normal). Additionally, I included dummy variables for patient to account for patient specific effects.

andreyshabalin commented 1 year ago

Let continue via email: andrey.shabalin@gmail.com Please send me your covariates.

andreyshabalin commented 1 year ago

Your last covariate (groupBy) is exactly the sum of the first 8 covariates. This is a case of collinearity which you should avoid.

Also, your first 12 covariates add up to 1 (a constant). This is another case of collinearity which you should avoid.

Andrey

RegnerM2015 commented 1 year ago

Thanks Andrey.

For additional context, I am attempting to implement Matrix EQTL in single-cell genomics data (matched scRNA-seq and scATAC-seq) where we can correlate the expression of genes with chromatin accessibility of certain genomic regions (aka peaks). Therefore, in my use case, the "snps" variable is replaced with chromatin accessibility denoted as "peak".

To clarify, it sounds like my desired model (gene ~ peak + condition + peak*condition + patient) is not feasible with my current experimental design. My groupBy covariate denotes which observations belong to the disease condition. To my understanding, I cannot avoid collinearity here since the patient dummy variables are not distributed across both conditions (disease v. normal). In other words, all of the observations denoted as normal are from normal patients while all all observations denoted as disease are from disease patients. If I had disease observations and normal observations from the same patient(s), this would help rectify the collinearity (for example: tumor tissue and adjacent normal tissue collected from the same patient). The problem would persist even if I dropped one of the dummy variables to serve as the "base reference level". Do you concur with this reasoning?

In summary, I would like to ask if you think there are workarounds for this situation and/or if my desired model is ultimately not theoretically possible?

andreyshabalin commented 1 year ago

Hi Matt,

First, I hope we can agree that Matrix eQTL indicated collinearity correctly.

Second, your research question is more complicated than it seems. I think it is possible to do the analysis you are after. But it should be done using linear mixed model, not OLS.

My best advice would be to find a statistician at your department to consult regarding this model.

Andrey

RegnerM2015 commented 1 year ago

Yes, I agree Matrix eQTL indicated collinearity correctly. However, I am still not sure why this collinearity error occurs in one computational environment but not another (with the same code and data). After our discussion, I now realize that this discrepancy is probably not as important.

I was also brainstorming a linear mixed effects model implementation, but initially I preferred MatrixEQTL for speed and scalability purposes.

Are you aware of any linear mixed effects model tools that operate similarly to Matrix_EQTL? Otherwise, I would probably use an slurm array job or another form of parallelization to test all peak-gene combinations (~200k).

Thanks for your help!