Closed nomennominatur closed 1 year ago
hi there, I can't see why it wouldn't show plots if you have show_plots=True
. My first guess is that there aren't any columns to evaluate. What does the following produce?
print(cph.params_.index)
print(cph.params_.index)
gives:
Index(['educ'], dtype='object', name='covariate')
The model seems to work on the covariate, as running the cph on the same dataset
cph.fit(survival_educ, 'TIMEDTH', 'DEATH')
cph.print_summary(model="untransformed variables", decimals=3)
gives:
<lifelines.CoxPHFitter: fitted with 4321 total observations, 2812 right-censored observations> duration col = 'TIMEDTH' event col = 'DEATH' baseline estimation = breslow number of observations = 4321 number of events observed = 1509 partial log-likelihood = -12297.619 time fit was run = 2023-04-11 07:12:17 UTC model = untransformed variables
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
covariate
educ -0.217 0.805 0.027 -0.270 -0.164 0.763 0.849
cmp to z p -log2(p)
Concordance = 0.562 Partial AIC = 24597.237 log-likelihood ratio test = 67.620 on 1 df -log2(p) of ll-ratio test = 52.164
The plots keep missing, also when I try various different covariates (i.e. numerical, categorical etc.)
Don't know, if this helps; The data looks like this: ###########################################
########################################### DEATH TIMEDTH educ 0 0 8766 4 1 0 8766 2 2 0 8766 1 3 1 2956 3 4 0 8766 3 ... ... ... ... 4316 1 6433 2 4317 1 6729 1 4318 0 8766 2 4319 0 8766 3 4320 0 8766 3
As stated in previous post, check_assumptions behaves the same, if performed with a dichotomous covariate (0/1).
Oh I think I know. Try this:
cph.check_assumptions(survival_educ, p_value_threshold=0.00, advice=True, show_plots=True)
This is not-obvious, but it skips plots if the p-value for the column is too low.
Unfortunately, with your above advice, I still get the same output
proportional hazard assumption looks okay.
and no graph.
Please compare _checkassumptions on these two structurally similiar datasets shown below. As far as I can see, the only difference is that (A) fulfills, and (B) violates the PH assumption.
survival_dicho= pd.read_csv("frmgham2_me_cohort-1_survival_educ_dichotomous.csv", sep=';')
print(survival_dicho)
DEATH TIMEDTH higher
0 0 8766 1 1 0 8766 1 2 0 8766 0 3 1 2956 1 4 0 8766 1 ... ... ... ... 4316 1 6433 1 4317 1 6729 0 4318 0 8766 1 4319 0 8766 1 4320 0 8766 1
survival_data = pd.read_csv("regimes_events_export_all.csv", sep=';')
print(survival_data)
duration observed dem 0 7 1 0 1 10 1 0 2 10 1 0 3 5 0 0 4 1 0 0 ... ... ... ... 1803 6 1 0 1804 1 0 0 1805 14 1 0 1806 1 1 0 1807 29 0 0
It is by pure chance, that in the above printout, there are only '0' in dem; in the whole dataset, there are 1186 '1' and 621 '0'
cph.check_assumptions(survival_dicho, p_value_threshold=0.05, advice=True, show_plots=True)
Proportional hazard assumption looks okay.
(As detailed above, the result ist the same with [...]p_value_threshold=0.00[...]
cph.check_assumptions(survival_data, p_value_threshold=0.05, show_plots=True)
The p_value_threshold
is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using check_assumptions(..., show_plots=True)
and looking for non-constant lines. See link [A] below for a full example.
check_assumptions
problem reproducible with dummy datasets@CamDavidsonPilon : _Please excuse me, if this post seems overly verbose! Since I am new to 'lifelines' I want to give you the opportunity to discriminate between user fault vs. unexpected behaviour of check_assumptions
_.
Thanks for bearing with me!
I have prepared two dummy datasets, and I will visualize and test them consecutively. When hazards are tested as proportional, unfortunately no plots are displayed, despite being specified 'True':
# Import libraries / methods
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
from lifelines import CoxPHFitter
cph = CoxPHFitter()
%matplotlib inline
# Import CSV datasets
dummy_a = pd.read_csv("dummy_a.csv", sep=';')
print(dummy_a)
dummy_b = pd.read_csv("dummy_b.csv", sep=';')
print(dummy_b)
TIME_a EVENT_a COVARIATE_a
0 1 1 0
1 2 1 0
2 3 1 0
3 4 1 0
4 5 1 0
5 6 1 0
6 7 1 0
7 8 1 0
8 9 1 0
9 10 1 0
10 11 1 0
11 12 1 0
12 13 1 0
13 14 1 0
14 15 1 0
15 16 1 0
16 17 1 0
17 18 1 0
18 19 1 0
19 20 1 0
20 2 1 1
21 4 1 1
22 6 1 1
23 8 1 1
24 10 1 1
25 12 1 1
26 14 1 1
27 16 1 1
28 18 1 1
29 20 1 1
30 20 0 1
31 20 0 1
32 20 0 1
33 20 0 1
34 20 0 1
35 20 0 1
36 20 0 1
37 20 0 1
38 20 0 1
39 20 0 1
TIME_b EVENT_b COVARIATE_b
0 1 1 0
1 1 1 0
2 2 1 0
3 2 1 0
4 3 1 0
5 3 1 0
6 3 1 0
7 4 1 0
8 4 1 0
9 4 1 0
10 6 1 0
11 8 1 0
12 10 1 0
13 12 1 0
14 16 1 0
15 18 1 0
16 20 0 0
17 20 0 0
18 20 0 0
19 20 0 0
20 10 1 1
21 15 1 1
22 16 1 1
23 16 1 1
24 17 1 1
25 17 1 1
26 17 1 1
27 17 1 1
28 18 1 1
29 18 1 1
30 18 1 1
31 18 1 1
32 18 1 1
33 18 1 1
34 19 1 1
35 19 1 1
36 19 1 1
37 19 1 1
38 19 1 1
39 20 0 1
T_a = dummy_a["TIME_a"]
E_a = dummy_a["EVENT_a"]
T_b = dummy_b["TIME_b"]
E_b = dummy_b["EVENT_b"]
# Kaplan-Meier
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(9,6))
fig.suptitle("Visualize Dummy Datasets")
# dummy_a:
value_range = [0, 1]
for value in value_range:
disc = (dummy_a["COVARIATE_a"] == value)
kmf.fit(T_a[disc], event_observed=E_a[disc], label=value)
kmf.plot_survival_function(ax=ax1, ci_show=False)
ax1.set_title("dummy_a", color = "#4C9900")
ax1.set(xlabel='(Time)')
plt.tight_layout()
# dummy_b:
value_range = [0, 1]
for value in value_range:
disc = (dummy_b["COVARIATE_b"] == value)
kmf.fit(T_b[disc], event_observed=E_b[disc], label=value)
kmf.plot_survival_function(ax=ax2, ci_show=False)
ax2.set_title("dummy_b", color = "#4C9900")
ax2.set(xlabel='(Time)')
plt.tight_layout()
cph.fit(dummy_a, 'TIME_a', 'EVENT_a')
cph.print_summary(model="untransformed variables", decimals=3)
cph.check_assumptions(dummy_a, p_value_threshold=0.05, show_plots=True)
model | lifelines.CoxPHFitter |
---|---|
duration col | 'TIME_a' |
event col | 'EVENT_a' |
baseline estimation | breslow |
number of observations | 40 |
number of events observed | 30 |
partial log-likelihood | -88.854 |
time fit was run | 2023-04-17 09:55:01 UTC |
model | untransformed variables |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | cmp to | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|---|
COVARIATE_a | -1.399 | 0.247 | 0.411 | -2.204 | -0.594 | 0.110 | 0.552 | 0.000 | -3.407 | 0.001 | 10.572 |
Concordance | 0.645 |
---|---|
Partial AIC | 179.708 |
log-likelihood ratio test | 12.724 on 1 df |
-log2(p) of ll-ratio test | 11.436 |
Proportional hazard assumption looks okay.
[]
cph.fit(dummy_b, 'TIME_b', 'EVENT_b')
cph.print_summary(model="untransformed variables", decimals=3)
cph.check_assumptions(dummy_b, p_value_threshold=0.05, show_plots=True)
model | lifelines.CoxPHFitter |
---|---|
duration col | 'TIME_b' |
event col | 'EVENT_b' |
baseline estimation | breslow |
number of observations | 40 |
number of events observed | 35 |
partial log-likelihood | -104.852 |
time fit was run | 2023-04-17 09:55:01 UTC |
model | untransformed variables |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | cmp to | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|---|
COVARIATE_b | -0.409 | 0.664 | 0.347 | -1.090 | 0.272 | 0.336 | 1.313 | 0.000 | -1.177 | 0.239 | 2.065 |
Concordance | 0.648 |
---|---|
Partial AIC | 211.704 |
log-likelihood ratio test | 1.362 on 1 df |
-log2(p) of ll-ratio test | 2.040 |
The ``p_value_threshold`` is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.
With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using ``check_assumptions(..., show_plots=True)``
and looking for non-constant lines. See link [A] below for a full example.
null_distribution | chi squared |
---|---|
degrees_of_freedom | 1 |
model | <lifelines.CoxPHFitter: fitted with 40 total o... |
test_name | proportional_hazard_test |
test_statistic | p | -log2(p) | ||
---|---|---|---|---|
COVARIATE_b | km | 22.05 | <0.005 | 18.52 |
rank | 23.25 | <0.005 | 19.43 |
@CamDavidsonPilon : The plotting code section for check_assumptions
in mixins.py
could be moved so that it will plot Schoenfeld residuals whenever show_plots = True
, regardless of the numeric test result.
Proposal: (Line 95-136, file in my repo): https://github.com/nomennominatur/check_assumptions_forceplot/blob/main/mixins.py
We would gain:
I feel too novice to do a pull request into your repository (sorry). However I would appreciate if you consider my proposal.
I have worked up my courage, and have made my first ever pull request: #1518
Discussed in https://github.com/CamDavidsonPilon/lifelines/discussions/1513