CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.33k stars 552 forks source link

Convergence halted due to matrix inversion problems #847

Open rojinsafavi opened 4 years ago

rojinsafavi commented 4 years ago

I would like to use lifelines package cox-regression analysis on TCGA data ( a very high dimensional dataset ~1500 features), and I keep getting the "Convergence halted due to matrix inversion problems". I was wondering if this package is suitable for such a high dimensional dataset?

CamDavidsonPilon commented 4 years ago

Hi @rojinsafavi,

I was wondering if this package is suitable for such a high dimensional dataset

Maybe. 1.5k features is untested by myself, and internally lifelines needs to compute a 1.5k x 1.5k variance matrix, which can be a lot for your computer's memory.

How many observations do you have? Does the problem persist if you increase the penalizer term?

CamDavidsonPilon commented 4 years ago

(Like I said, I've never tested a dataset with that many features. If you are able to share it with my privately, that would be appreciated to help extended lifelines to support high dimensional data better)

ahlusar1989 commented 4 years ago

@rojinsafavi I might share my own experiences working with such a high dimensional dataset. The error that you receive is very useful. To briefly summarize my own experience (and I directly quote some of the helpful comments provided by @CamDavidsonPilon:

The reason my gradients were exploding or where convergence was never achieved was due to high collinearity in the dataset. A common cause of this error is dummying categorical variables but not dropping a column, or some hierarchical structure in your dataset.

I verified this by "finding the relationship by adding a penalized term to the model, or using the variance inflation factor (VIF) to find redundant variables."

It will take some work, but you can exhaustively permute on the matrix's columns in order to ensure rank and ensure that each matrix indeed has an inverse. In addition, I would assume that one would also exercise judgement, based on domain knowledge, to find the relevant covariates.

I also implemented a similar heuristic as described in the paper here. I hope this may be of some use.