CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.38k stars 560 forks source link

Convergence halted due to high collinearity #1148

Closed akaph2p closed 4 years ago

akaph2p commented 4 years ago

Hi,

I'm using the NASA dataset on Turbofans for remaining-useful-life prediction using the CoxPHFitter function. I get the error stating 'Convergence is halted due to matrix inversion problems'. Now, that data-set includes information about one type of machine from 25 different sensors so correlation between different columns is expected. I'd like to keep all that information because every reading gives us a different perspective about the operation of the machine.

I tried using a penalizer for the CPH function, but my error still didnt go away. it is possible I used it incorrectly. Do i require an array of penalizer values to for each column in my data-set?

image

fandata.zip

CamDavidsonPilon commented 4 years ago

Hi @akaph2p,

1) is there a constant column in your dataset? If so, drop that. 2) what is the ratio between events to non-events in your dataset? 3) If you try a really really large penalizer, does it still fail?

akaph2p commented 4 years ago

Hi @CamDavidsonPilon ,

there were a few constant columns in my dataset. After dropping them, the convergence error did go away. Thank you for that suggestion.

I'd like to know more about why that solves my issue. Could you please elaborate on that?

CamDavidsonPilon commented 4 years ago

check out the literature on multicolinearity: https://stats.stackexchange.com/questions/70899/what-correlation-makes-a-matrix-singular-and-what-are-implications-of-singularit, https://stats.stackexchange.com/questions/1149/is-there-an-intuitive-explanation-why-multicollinearity-is-a-problem-in-linear-r