CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.38k stars 560 forks source link

Cox regression analysis #502

Closed nigelvahey closed 6 years ago

nigelvahey commented 6 years ago

Hi @CamDavidsonPilon,

A colleague of mine produced the attached SPSS cox regression output involving an interaction between a continuous covariate and a three-level categorical covariate OUTPUT.pdf

I have been trying various configurations of dummy variables with lifelines to no avail -- in particular, I haven't been able to generate a model that includes both the contrast interactions and the main categorical effect/interaction without generating the following problem:

"LinAlgError: Singular matrix: This means that there is a linear combination in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. Try to find the relationship by looking at the correlation matrix of your dataset."

As far as you are aware, is there any way to get lifelines to reproduce the attached analysis output?

Best regards,

Nigel

CamDavidsonPilon commented 6 years ago

Hey Nigel, can you share how you are preprocessing the data frame? Specifically, how are you creating the dummy variables and interactions? Sharing python code would help too

nigelvahey commented 6 years ago

No problem -- thanks for getting back to me so quickly @CamDavidsonPilon!

I'm using the following code on a pandas dataframe with the 'Condition_new' categorical variable having three levels (1=SmartQuit,2=Contingency Management, 3=TAU).

pss_survival['SmartQuit']=np.where(pss_survival['Condition_new']==1,1,0)
pss_survival['Contingency Management']=np.where(pss_survival['Condition_new']==2,1,0)
pss_survival['TAU']=np.where(pss_survival['Condition_new']==3,0,0)

#Contrast interactions using 'TAU' as reference condition for contrast
pss_survival['SmartQuit_TAU_PSS_interaction']=pss_survival.loc[:,'SmartQuit'].multiply(pss_survival.loc[:,'PSS_TOTAL'])
pss_survival['CM_TAU_PSS_interaction']=pss_survival.loc[:,'Contingency Management'].multiply(pss_survival.loc[:,'PSS_TOTAL'])

cph = CoxPHFitter()
cph.fit(pss_survival, duration_col='Days_to_relapse_excluding_losses_to_follow_up', event_col='30_day_relapsed', show_progress=True)

cph.print_summary() 
Results=cph.summary

cph.plot_covariate_groups('Condition_new', [1,2,3])
plt.xlabel('Time to Relapse in Days')
plt.ylabel('Proportion Surviving')
plt.title('Survival function of by Treatment Condition')
plt.yticks([0,.25,.5,.75,1.0],['0%','25%','50%','75%','100%'])

AND this is the error code it generates:

Traceback (most recent call last):

  File "<ipython-input-115-5cedc4fa0d69>", line 2, in <module>
    cph.fit(pss_survival, duration_col='Days_to_relapse_excluding_losses_to_follow_up', event_col='30_day_relapsed', show_progress=True)

  File "C:\Users\nigel\Anaconda2\lib\site-packages\lifelines\fitters\coxph_fitter.py", line 149, in fit
    step_size=step_size)

  File "C:\Users\nigel\Anaconda2\lib\site-packages\lifelines\fitters\coxph_fitter.py", line 237, in _newton_rhaphson
    delta = spsolve(-h, step_size * g.T, sym_pos=True)

  File "C:\Users\nigel\Anaconda2\lib\site-packages\scipy\linalg\basic.py", line 251, in solve
    _solve_check(n, info)

  File "C:\Users\nigel\Anaconda2\lib\site-packages\scipy\linalg\basic.py", line 31, in _solve_check
    raise LinAlgError('Matrix is singular.')

LinAlgError: Matrix is singular.
CamDavidsonPilon commented 6 years ago

How many values can Condition_new take on? Do you need to drop one of them as a "baseline" variable? Ex: https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding

nigelvahey commented 6 years ago

It takes three values @CamDavidsonPilon and I'd like to contrast against the third as a baseline reference to contrast other two conditions against -- but even when I exclude that third dummy variable TAU (i.e. all zeros) from the dataframe I still get a singlularity unless I also delete 'Condition_New' itself, and this then prevents me from considering the main effect of 'Condition_New' in my covariate plot or otherwise. Honestly, I'm not very confident that I understand the dummy coding that is automatically implemented in the background by SPSS but in any case I was wondering whether it might be possible to achieve something similar to the SPSS output attached above but using Lifelines

CamDavidsonPilon commented 6 years ago

but even when I exclude that third dummy variable TAU (i.e. all zeros)

mm, how do you exclude it? You set the entire column to 0? You need to drop the column entirely from the analysis.

Note the following perfect collinearity:

Condition_New = 1* SmartQuit + 2*"Contingency Management" + 3* TAU

So yes, you do need to drop Condition_New.

Looking at the output of the SPSS analysis, they drop Condition_new(3) - so that is TAU in your example.

nigelvahey commented 6 years ago

Thanks for sticking with me on this @CamDavidsonPilon

I tried with various different dummy codings in an effort to reproduce the relevant SPSS statistical output using the same data using trial and error -- crucially, one of those attempts involved completely dropping the column TAU entirely from the dataframe being passed to cph.fit

Unfortunately, even when I do completely drop the 'TAU' column my cph.fit still results in a singularity -- presumably because it only takes two dummy variables to fully define a three level categorical variable (i.e. the third dummy variable is essentially 'turned off' being composed entirely of zeros)....thus somehow resulting in collinearity within cph.fit.

My key problem: I just don't know how to to make Lifeline cph.fit retain contrasts using the two non-TAU dummy coded variables while also retaining the main 'Condition_new' variable (or even better how to instruct cph.fit to automatically process 'Condition_new' hierarchically in terms of itself and the two subsidiary contrasts of interest) -- somehow the SPSS software is able to provide a main effect of 'Condition_new' AND also the two aforementioned contrasts within 'Condition_new' (while also dealing with the resulting interactions between these categorical effects and the continuous variable PSS_TOTAL....I think these latter interactions would be easy to produce if I could just figure out how to retain both 'Condition_new' and its two subsidiary contrasts of interest)

CamDavidsonPilon commented 6 years ago

Are you able to share the dataset with me? You can privately over cam.davidson.pilon@gmail.com

nigelvahey commented 6 years ago

Thanks @CamDavidsonPilon -- I'll do that now

CamDavidsonPilon commented 6 years ago

So I got pretty close to the Stata results. One thing that differs though, is that Stata drops 3 rows, and I can't find which ones (it suggest three rows with missing data, but I don't see them in the dataset you sent me). Here's a script to reproduce what Stata is doing:

df = pd.read_excel("/Users/camerondavidson-pilon/Downloads/pss_survival_sample.xlsx")

# transform data
df_ = pd.get_dummies(df, columns=['Condition_new'])
del df_['Condition_new_3']
df_['Condition_new_1 * PSS_TOTAL'] = df_['Condition_new_1'] * df_['PSS_TOTAL']
df_['Condition_new_2 * PSS_TOTAL'] = df_['Condition_new_2'] * df_['PSS_TOTAL']
cols = ['Condition_new_2 * PSS_TOTAL', 'Condition_new_1 * PSS_TOTAL', 'Condition_new_1', 'Condition_new_2', 'Days_to_relapse_excluding_losses_to_follow_up', '30_day_relapsed']

# lifelines
from lifelines import CoxPHFitter
cp = CoxPHFitter()
cp.fit(df_[cols], 'Days_to_relapse_excluding_losses_to_follow_up', '30_day_relapsed')
cp.print_summary()

The observed differences are due to the three extra rows in my dataset, I believe.