`proportional_hazard_test` tests don't match R well

CamDavidsonPilon commented 4 years ago

Possible solution: https://github.com/CamDavidsonPilon/lifelines/issues/997#issuecomment-652567848

aysrivastava94 commented 4 years ago

I checked. Proportional_hazard_test results (test statistic and p value) are same irrespective of which transform I use. Perhaps there is some accidentally hard coding of this in the backend?

CamDavidsonPilon commented 4 years ago

hm, that behaviour sounds strange, but must be data specific. I've been looking into this function recently, and have seen difference between transforms. I'll investigate further however.

aysrivastava94 commented 4 years ago

I can upload my codes if needed. Let me know

CamDavidsonPilon commented 4 years ago

That would be appreciated! It would be nice to understand the behaviour more.

aysrivastava94 commented 4 years ago

Here you go (Link to the R results I attempted to mimic: http://www.sthda.com/english/wiki/cox-model-assumptions)

# Importing packages
import lifelines
import pandas as pd
from lifelines import datasets

# Importing dataset
lung_dataset = datasets.load_lung()

# Cox PH
lung_coxPH = lifelines.CoxPHFitter().fit(duration_col = 'time', event_col = 'status', df = lung_dataset.drop(['inst', 'ph.ecog', 'ph.karno', 'pat.karno', 'meal.cal'],axis = 1).dropna())

# Proportional Hazard Testing
# KM transform
lifelines.statistics.proportional_hazard_test(lung_coxPH, lung_dataset.drop(['inst', 'ph.ecog', 'ph.karno', 'pat.karno', 'meal.cal'], axis = 1).dropna(), transform = 'km').print_summary()

# Rank transform
lifelines.statistics.proportional_hazard_test(lung_coxPH, lung_dataset.drop(['inst', 'ph.ecog', 'ph.karno', 'pat.karno', 'meal.cal'], axis = 1).dropna(), transform = 'rank').print_summary()

# Result obtained in both case
# time_transform | rank
# null_distribution | chi squared
# degrees_of_freedom | 1
# transform | km
# test_name | proportional_hazard_test

#         test_statistic      p
# age              2.35 0.13
# sex              4.86 0.03
# wt.loss             0.76  0.38

taoxu2016 commented 4 years ago

The events col in lung_dataset is "1" for censored and "2" for dead. With your code, all the events would be True. Please include below line in your code:

lung_dataset["status"] = lung_dataset["status"] - 1

# Results obtained with the above update
#              test_statistic     p
# age            0.67            0.41
# sex            2.17            0.14
# wt.loss    0.03            0.86

Still not exactly the same as the results from R.

CamDavidsonPilon commented 4 years ago

@taoxu2016 is correct, and another change needs to be made:

transform should read time_transform:

lifelines.statistics.proportional_hazard_test(lung_coxPH, lung_dataset.drop(['inst', 'ph.ecog', 'ph.karno', 'pat.karno', 'meal.cal'], axis = 1).dropna(), time_transform = 'km').print_summary()

CamDavidsonPilon commented 4 years ago

So I dug deeply into this problem.

In version 3.0 of survival, released 2019-11-06, a new, more accurate version of the cox.zph was introduced. This avoided an assumption of variance matrices do not varying much over time.

This also explains why when I wrote this function for lifelines (late 2018), all my tests that compared lifelines with R were working fine, but now are giving me trouble.

I have no plans at this time to update this function to use the more accurate version. My attitudes towards the PH assumption have changed in the meantime.

MetzgerSK commented 3 years ago

A follow-up on this: I was cross-referencing R's **old** cox.zph calculations (< survival 3, before the routine was updated in 2019) with check_assumptions()'s output, using the rossi example from lifelines' documentation...

...and I'm finding the output doesn't match. I used Stata (which still uses the PH test approximation) to verify that nothing odd was occurring with survival::cox.zph's calculations.

Summary of Relevant Output

Cox Model Output

:heavy_check_mark: R and Python match.
:heavy_check_mark: R and Stata match.

// R (trimmed)
    Call:
    coxph(formula = Surv(week, arrest) ~ age + fin + mar + paro,
    data = py$rossi, ties = "breslow")

    n= 432, number of events= 114

             coef exp(coef) se(coef)      z Pr(>|z|)
    age  -0.06656   0.93560  0.02104 -3.163  0.00156 **
    fin  -0.34626   0.70733  0.19017 -1.821  0.06863 .
    mar  -0.49298   0.61080  0.37431 -1.317  0.18783
    paro -0.17357   0.84065  0.19253 -0.902  0.36729
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

// Python (trimmed)
                coef  exp(coef)   se(coef)   coef lower 95%   coef upper 95%
    covariate
    age       -0.067      0.935      0.021           -0.108           -0.025
    fin       -0.346      0.707      0.190           -0.719            0.026
    mar       -0.494      0.610      0.374           -1.227            0.240
    paro      -0.174      0.840      0.193           -0.551            0.204

// Stata
    Cox regression with Breslow method for ties

    No. of subjects =    432                                Number of obs =    432
    No. of failures =    114
    Time at risk    = 19,809
                                                            LR chi2(4)    =  21.24
    Log likelihood = -665.06224                             Prob > chi2   = 0.0003

    ------------------------------------------------------------------------------
              _t | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             age |  -.0665648   .0210428    -3.16   0.002    -.1078079   -.0253217
             fin |  -.3462649   .1901694    -1.82   0.069    -.7189901    .0264602
             mar |  -.4929785   .3743093    -1.32   0.188    -1.226611    .2406542
            paro |  -.1735749   .1925273    -0.90   0.367    -.5509215    .2037717
    ------------------------------------------------------------------------------

PH Test Output

:x: R and Python don't match.
:heavy_check_mark: R and Stata match.

// R
    # A tibble: 8 x 4
      covar tFF        chisq       p
      <chr> <chr>      <dbl>   <dbl>
    1 age   km    7.07       0.00782
    2 age   rank  6.91       0.00857
    3 fin   km    0.00000363 0.998
    4 fin   rank  0.000932   0.976
    5 mar   km    2.20       0.138
    6 mar   rank  2.14       0.143
    7 paro  km    0.0473     0.828
    8 paro  rank  0.0355     0.851

// Python
               test_statistic    p  -log2(p)
    age  km              6.79 0.01      6.77
         rank            7.21 0.01      7.11
    fin  km              0.00 0.95      0.07
         rank            0.00 0.96      0.06
    mar  km              2.09 0.15      2.75
         rank            2.28 0.13      2.93
    paro km              0.03 0.87      0.20
         rank            0.04 0.85      0.24

// Stata
    Test of proportional-hazards assumption

    Time function: 1 - Kaplan-Meier estimate
    --------------------------------------------------------
                 |        rho     chi2       df    Prob>chi2
    -------------+------------------------------------------
             age |   -0.21256     7.07        1       0.0078
             fin |   -0.00018     0.00        1       0.9985
             mar |    0.13497     2.20        1       0.1377
            paro |   -0.02002     0.05        1       0.8279
    -------------+------------------------------------------
     Global test |                8.12        4       0.0874
    --------------------------------------------------------

    Test of proportional-hazards assumption

    Time function: Rank of analysis time
    --------------------------------------------------------
                 |        rho     chi2       df    Prob>chi2
    -------------+------------------------------------------
             age |   -0.21008     6.91        1       0.0086
             fin |    0.00283     0.00        1       0.9756
             mar |    0.13304     2.14        1       0.1434
            paro |   -0.01735     0.04        1       0.8505
    -------------+------------------------------------------
     Global test |                7.92        4       0.0946
    --------------------------------------------------------

MWEs

Python

from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter

rossi = load_rossi()

cph = CoxPHFitter()

# Simplified model spec from readme ex (fewer covars, no strata var)
cph.fit(rossi, 'week', event_col='arrest', formula="age + fin + mar + paro")
cph.print_summary(model="untransformed variables", decimals=3)
cph.check_assumptions(rossi, advice=False, p_value_threshold=0.05)

R

## IMPORTANT: installed survival version must be <= 2.44-1
library(dplyr)
library(tibble)
library(reticulate)
#*******************************************************************************
# MUST be survival 2.44-1 or earlier for the approximation of T&G's score test
tempDir <- "C:/Users/Public/Documents/" # temporary directory for survival 2.44.1 install
remotes::install_version("survival", version = "2.44-1", lib = tempDir)
library(survival, lib.loc = tempDir)

# Load data via Python
repl_python()
    from lifelines.datasets import load_rossi
    rossi = load_rossi()
exit

# Estm model ----
mod <- coxph(Surv(week, arrest) ~ age + fin + mar + paro,
             data=py$rossi, ties="breslow")

summary(mod) # this spec's output now matches Python

# PH test ----
tFFs <- c("km", "rank")

## Approx ====
zph.appr <- lapply(tFFs,
                   function(x){
                        cox.zph(mod, transform=x)$table %>%
                        data.frame %>%
                        rownames_to_column(var = "covar") %>%
                        mutate(tFF = x,
                               dispOrd = row_number(),
                               tFFOrd = which(tFFs==x))
                   })
## (Output in a way that mirrors Python, for sanity's sake)
zph.appr <- do.call(bind_rows, zph.appr) %>%
                arrange(dispOrd, tFFOrd) %>%
                select(covar, tFF, chisq, p) %>%
                filter(covar!="GLOBAL") %>%
                tibble

## Print ====
zph.appr

Stata

python
from sfi import *
from lifelines.datasets import load_rossi
rossi = load_rossi()

# Throw back into Stata
Data.setObsTotal(len(rossi))
varlist = ["week", "arrest", "age", "fin", "mar", "paro"]
for x in varlist:
    Data.addVarInt(x)
    Data.store(x, None, rossi[x], None)
end

// Estm model
stset week, fail(arrest)
stcox age fin mar paro, nohr breslow

// PH test
estat phtest, detail km
estat phtest, detail rank

CamDavidsonPilon commented 3 years ago

Hi @MetzgerSK - thanks for the (very) detailed report. I'll look into this soon. Apologies that this is occurring. I'm relieved that a previous-me did write tests for this function, but that was on a different dataset. I'll review why rossi dataset is different, building off what you've shown here.

CamDavidsonPilon commented 3 years ago

I haven't yet dug into this, but my suspicion is that the results are due to how ties are handled. rossi has lots of ties, whereas the testing dataset I used has none.

MetzgerSK commented 3 years ago

Possibly. I did quickly check the (unscaled) Schoenfelds out of lifelines' compute_residuals() and survival 2.44-1's resid() for the rossi data, using the models from my original MWE.

Original rossi (ties): The Schoenfeld values are close for a given t-covariate profile pairing, but R and Python don't match.
Toy rossi (no ties + only one censored data point at t = 52, the largest duration in the toy data): R's and Python's Schoenfelds match.

Not definitive, but it's suggestive.

Toy `rossi`

from lifelines.datasets import load_rossi
rossi = load_rossi()
rossi_dup = rossi.sort_values(by=['week', 'age', 'race', 'wexp', 'mar', 'paro', 'prio'])  
   # ^ quick attempt to get unique sort order
rossi_dup.drop_duplicates(subset = "week", keep = 'first', inplace = True)

CamDavidsonPilon / lifelines