Closed junmeiW closed 5 years ago
please post a minimal reproducible example that includes your current data structure and the model you are trying to fit, as well as which steps you have already tried and how they failed, this is not enough information for us to help you. (do read this before replying, please.)
I'm sorry the unclear question(the previous one)
I want to do COX model with this data to get each variable' hazard ratio , there are variables
age, gender , apache_iv , charlson_score
four_bin, furosemide, vaso_inotro, vaso_use_firstday and intubated_firstday
and variablesfour_bin, furosemide, vaso_inotro
may change with time.
cicu_cox <- coxph(Surv(tstart,tstop, status ) ~ (four_bin) + age + gender + apache_iv +
charlson_score +furosemide + vaso_inotro+
vaso_use_firstday+intubated_firstday, data = cicu_7day )
cicu_coxz<- cox.zph(cicu_cox, transform = "log")
print(cicu_coxz) ## test cox PH assumption
I got the test results like below
rho chisq p
four_bin2 0.0239 1.09 0.296837
four_bin3 0.0821 12.87 0.000334
four_bin4 0.0773 11.51 0.000692
age 0.0794 11.22 0.000810
genderm -0.0241 1.11 0.291376
apache_iv 0.0650 8.23 0.004129
charlson_score 0.0310 1.88 0.170766
furosemide1 0.0456 4.09 0.043147
vaso_inotro1 -0.0810 14.20 0.000165
vaso_use_firstday1 -0.0453 4.48 0.034266
intubated_firstday1 0.0392 3.13 0.076828
GLOBAL NA 123.48 0.000000
And I add term:time interaction:
tcicu_7day.t <- update(sicu_coxt1,.~.+model.matrix(~four_bin)[,c(-1,-3,-4)]:(log(tstart+1))+
model.matrix(~four_bin)[,c(-1,-3,-2)]:(log(tstart+1))) +
model.matrix(~furosemide)[,c(-1)]:log(tstart+1)+
age:log(tstart+1)+ apache_iv:log(tstart+1)
model.matrix(~vaso_inotro)[,c(-1)]:log(tstart+1)+
model.matrix(~intubated_firstday)[,c(-1)]:log(tstart+1)+
model.matrix(~vaso_use_firstday)[,c(-1)]:(log(tstart+1))
)
tcicu_coxz.ct <- cox.zph(tcicu_7day.t, transform = "log")
print(tcicu_coxz.ct)
I got the results like this:
rho chisq p
four_bin2 0.01064 0.22119 0.638132
four_bin3 0.08077 12.38137 0.000434
four_bin4 0.00750 0.10649 0.744180
age -0.01375 0.33602 0.562138
genderm -0.02353 1.06006 0.303201
apache_iv 0.03522 2.28155 0.130921
charlson_score 0.03052 1.82046 0.177259
furosemide1 0.00180 0.00624 0.937028
vaso_inotro1 -0.04090 3.38903 0.065631
vaso_use_firstday1 0.02402 1.16885 0.279637
intubated_firstday1 -0.00118 0.00278 0.957965
model.matrix(~four_bin)[, c(-1, -3, -4)]:log(tstart + 1) 0.02222 0.94573 0.330808
log(tstart + 1):model.matrix(~furosemide)[, c(-1)] -0.00790 0.11732 0.731955
age:log(tstart + 1) 0.00192 0.00638 0.936349
apache_iv:log(tstart + 1) -0.08377 13.37122 0.000256
log(tstart + 1):model.matrix(~vaso_inotro)[, c(-1)] 0.07138 11.64070 0.000645
log(tstart + 1):model.matrix(~intubated_firstday)[, c(-1)] 0.01025 0.21272 0.644640
log(tstart + 1):model.matrix(~vaso_use_firstday)[, c(-1)] -0.03012 1.82853 0.176302
log(tstart + 1):model.matrix(~four_bin)[, c(-1, -3, -2)] 0.00488 0.04498 0.832033
GLOBAL NA 42.73653 0.001408
**There still are a lot of variables that don't match the cox proportional hazard assumption, and I don't know how to do next step. Can pammtools process this data and get hazard ratio values I want ?? Could you please give me some suggestions about the problem?
Thanks again **
Hi, thanks for the effort to improve the question. It is clearer now, but the amount of help we can offer is still limited if we have no reproducible example to show how to do this.
That being said, here are some notes:
1) The way you set up your 2nd cox model, the cox.zph output is not really meaningful because it considers each term separately (i.e. four_bin3
and its time-varying component) when it should consider them together when calculating the Schoenfeld residuals/test statistic. Not sure if this can be easily fixed. Also, often the test indicates violation when the functional form of a (continuous) covariate effect is missspecified (e.g. linear vs. quadratic, etc.), thus low p-value do not necessarily mean that the effect is time-varying.
2) You already have the data in appropriate format to fit a PAM/GAM but if you don't use as_ped
to transform the data, you won't be able to use all the post-processing functions directly, however, you can just fit the model and use for example predict.gam
to obtain predictions, etc. (see the recidivism example)
3) If you want to use as_ped
to transform the data, the easiest workaround is to bring the data into an format equivalent to pbc
and pbcseq
, i.e., something like this (the employed
variable is time-dependent):
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(pammtools)
#>
#> Attaching package: 'pammtools'
#> The following object is masked from 'package:stats':
#>
#> filter
#### This creates the data in the same format as you have
# raw data
# http://socserv.mcmaster.ca/jfox/Books/Companion/scripts/appendix-cox.R
recidivism <- read.table(
file = "http://math.unm.edu/~james/Rossi.txt",
header = TRUE) %>%
mutate(subject=row_number())
# transform into long format
recidivism_long <- recidivism %>%
gather(calendar.week, employed, emp1:emp52) %>%
filter(!is.na(employed)) %>% # employed unequal to NA only for intervals under risk
group_by(subject) %>%
mutate(
start = row_number()-1,
stop = row_number(),
arrest = ifelse(stop == last(stop) & arrest == 1, 1, 0),
offset = log(stop - start)) %>%
select(subject, start, stop, offset, arrest, employed, fin:educ) %>%
arrange(subject, stop)
#### From this data format, create an PED format data set:
event_df <- recidivism_long %>%
group_by(subject) %>%
slice(n()) %>%
ungroup() %>%
select(-start)
tdc_df <- recidivism_long %>%
select(subject, start, employed) %>%
rename(tdc_time = "start")
ped <- as_ped(list(event_df, tdc_df), cut = c(unique(recidivism_long$stop)),
Surv(stop, arrest)~. | concurrent(employed, tz_var = "tdc_time"),
id = "subject")
### -> continue with analysis
Created on 2019-03-27 by the reprex package (v0.2.1)
4) For the actual analysis you will have to take some care to ensure identifiability. See section 4.2 here. Also, I posted a less complex example with time-varying effects and time-dependent covariates (using pbc
data), including calculation of survival probabilities for specific covariate histories at SO.
@junmeiW was that of any help?
Sorry to reply message so late. Thanks for your reply(It helps a lot). I'll trouble you again:
fit1<-npsurv(Surv(tstart, tstop, status)~inotrope_use_firstday,data=mydata)
survplot(fit1,loglog=T,logt=T, conf="none",
label.curves=list(method="arrow",cex=1))
and I got all results like this figure except variable 'four_bin'(2nd figure with crossed lines)
This picture above shows the parallel lines.
Like the picture , can I think that this kind of variables(first picture) fulfill the cox PH assumption?
But in the cox.zph() test results, p-values of variables like furosemide, vaso_inotro,vaso_use_firstday, inotrope_use_firstday
are less than 0.01
rho chisq p
four_bin2 0.02789 1.484 2.23e-01
four_bin3 0.08163 12.692 3.67e-04
four_bin4 0.06786 8.869 2.90e-03
age 0.08305 12.557 3.95e-04
genderm -0.01785 0.604 4.37e-01
apache_iv 0.05747 6.019 1.42e-02
charlson_score 0.01152 0.264 6.08e-01
furosemide1 0.05935 6.977 8.26e-03
vaso_inotro1 0.07478 7.899 4.95e-03
vaso_use_firstday1 -0.18378 61.163 5.22e-15
inotrope_use_firstday1 -0.21321 76.649 0.00e+00
intubated_firstday1 0.09715 20.281 6.69e-06
GLOBAL NA 269.200 0.00e+00
emm...... As you have pointed that cox.zph output is not really meaningful because it considers each term separately , I wonder how to test the cox model? Can the above code 'survplot' work?
also I found the p-value by cox.zph test differs from results from pam,
e.g. variable apache_iv
rho chisq p
other varibales ... ... ....
apache_iv 0.03522 2.28155 0.130921
and gam
code :
cicu_pam <- gam(status ~ s(tstop) #+s(sqrt(age))
+ charlson_score+gender+ s((apache_iv)) , data=cicu_7day,
family=poisson(), offset=offset)
plot(cicu_pam,select = 2, se = TRUE, ylim = c(-2, 2))
the picture shows that the non-linear effect of apache_iv, which is inconsistent with the cox.zph test
Now, I felt stuck and upset I don't know how to do this? Sorry to trouble you.
And I attach the file(part of my data) on this message[the original file is too large] test.xlsx
Yeah, personally I'm not a big fan of the cox.zph
. You can just fit the covariates with potential time-variation and interpret the estimated effects.
Also, modelling time-dependent covariates as time-varying effects might not be appropriate, at least not the way it is done below. I think I would just include the time-dependent covariates as a normal covariate, without time-variation.
Below is an example, but be careful. In the test data there are barely any events after the first couple intervals. The last one is at day 52, so modelling beyond 52 days doesn't make sense + be careful with interpretation as effects might depend on a view events in specific intervals). Obviously you need to check for plausibility, etc. Especially the effect shape of charlson_score
looks somewhat suspicious:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(pammtools)
#>
#> Attaching package: 'pammtools'
#> The following object is masked from 'package:stats':
#>
#> filter
library(mgcv)
#> Loading required package: nlme
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
#> This is mgcv 1.8-27. For overview type 'help("mgcv-package")'.
library(ggplot2)
theme_set(theme_bw())
Set1 <- RColorBrewer::brewer.pal(9, "Set1")
httr::GET(
"https://github.com/adibender/pammtools/files/3021131/test.xlsx",
httr::write_disk(tf <- tempfile(fileext = ".xlsx")))
#> Response [https://github-production-repository-file-5c1aeb.s3.amazonaws.com/106259608/3021131?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20190329%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190329T122523Z&X-Amz-Expires=300&X-Amz-Signature=9b8bd9269ce11ccef30ac86628a9d107d3f2409eb928d37978c1943d1d91e48a&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3Bfilename%3Dtest.xlsx&response-content-type=application%2Fvnd.openxmlformats-officedocument.spreadsheetml.sheet]
#> Date: 2019-03-29 12:25
#> Status: 200
#> Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
#> Size: 2.93 MB
#> <ON DISK> /tmp/Rtmp4wAlc0/file3db6221c0dea.xlsx
test <- readxl::read_excel(tf)
#> New names:
#> * `` -> `..1`
test[, 1] <- NULL
test <- test %>%
mutate(
four_bin = as.factor(four_bin),
gender = as.factor(gender))
#### From this data format, create an PED format data set:
event_df <- test %>%
group_by(icustay_id) %>%
slice(n()) %>%
ungroup() %>%
select(-tstart)
tdc_df <- test %>%
select(icustay_id, tstart, four_bin, furosemide, vaso_inotro)
ped <- as_ped(list(event_df, tdc_df), Surv(tstop, status)~. |
concurrent(four_bin, furosemide, vaso_inotro, tz_var = "tstart"),
id = "icustay_id")
mod <- bam(
formula = ped_status ~
ti(tend) +
gender + vaso_use_firstday + intubated_firstday +
four_bin + ti(tend, by = as.ordered(four_bin)) +
furosemide + ti(tend, by = as.ordered(furosemide)) +
vaso_inotro + ti(tend, by = as.ordered(vaso_inotro)) +
te(age, tend) +
te(tend, apache_iv) +
te(tend, charlson_score),
data = ped,
family = poisson(),
offset = offset,
method = "fREML",
control = list(trace = TRUE))
#> Setup complete. Calling fit
#> Deviance = 7194.75642633487 Iterations - 1
#> Deviance = 2917.65258895727 Iterations - 2
#> Deviance = 1275.75480118066 Iterations - 3
#> Deviance = 708.565720874756 Iterations - 4
#> Deviance = 524.307913551784 Iterations - 5
#> Deviance = 465.541077988574 Iterations - 6
#> Deviance = 443.630088833445 Iterations - 7
#> Deviance = 430.672513554005 Iterations - 8
#> Deviance = 435.094045071483 Iterations - 8
#> Deviance = 427.543065387718 Iterations - 9
#> Deviance = 430.494521906121 Iterations - 9
#> Deviance = 426.039625904847 Iterations - 10
#> Deviance = 427.929511622135 Iterations - 10
#> Deviance = 425.322282531327 Iterations - 11
#> Deviance = 424.782101203684 Iterations - 12
#> Deviance = 424.67482837376 Iterations - 13
#> Deviance = 424.780356105867 Iterations - 13
#> Deviance = 424.67439208804 Iterations - 14
#> Deviance = 424.72641192035 Iterations - 14
#> Deviance = 424.661290844853 Iterations - 15
#> Deviance = 424.693296434331 Iterations - 15
#> Deviance = 424.653588172151 Iterations - 16
#> Deviance = 424.641879365913 Iterations - 17
#> Deviance = 424.647624008846 Iterations - 17
#> Deviance = 424.639953885741 Iterations - 18
#> Deviance = 424.637542914791 Iterations - 19
#> Deviance = 424.63677341798 Iterations - 20
#> Deviance = 424.637149560176 Iterations - 20
#> Deviance = 424.636643524464 Iterations - 21
#> Deviance = 424.63648254924 Iterations - 22
#> Deviance = 424.636561121195 Iterations - 22
#> Deviance = 424.636455327709 Iterations - 23
#> Deviance = 424.636507062734 Iterations - 23
#> Deviance = 424.636437693936 Iterations - 24
#> Deviance = 424.636415834985 Iterations - 25
#> Fit complete. Finishing gam object.
#> user.self sys.self elapsed
#> initial 0.545 0.001 0.545
#> gam.setup 0.470 0.000 0.470
#> pre-fit 0.000 0.000 0.000
#> fit 35.007 0.004 35.011
#> finalise 0.486 0.000 0.486
## 1d smooths
smooths1d <- tidy_smooth(mod)
ggplot(smooths1d, aes(x = x, y = fit)) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = .2) +
geom_line() +
facet_wrap(~ylab, scales = "free")
## 2d smooths
gg_tensor(mod) + xlab("time")
## slices through 2d smooths
# age effect
gg_slice(ped, mod, "age", age = seq_range(age, 20), tend = seq_range(tend, 3))
gg_slice(ped, mod, "age", tend = unique(tend), age = seq_range(age, 3))
# apache effect
gg_slice(ped, mod, "apache_iv", apache_iv = seq_range(apache_iv, 20),
tend = seq_range(tend, 3))
gg_slice(ped, mod, "apache_iv", tend = unique(tend),
apache_iv = seq_range(apache_iv, 3))
# apache effect
gg_slice(ped, mod, "charlson_score",
charlson_score = seq_range(charlson_score, 20), tend = seq_range(tend, 3))
gg_slice(ped, mod, "charlson_score", tend = unique(tend),
charlson_score = seq_range(charlson_score, 3))
Created on 2019-03-29 by the reprex package (v0.2.1)
Wow, great!!!
Thank you very much for receiving your reply and patient guidance. Maybe I need to study your code and these pictures carefully. I'll give you the latest progress if it goes well.
Thanks again.
Hi, I‘m back. Problem is coming...
As you pointed,modelling time-dependent covariates as time-varying effects might not be appropriate(especially the way I have done). But I need cox model to give the appropraite results(including hazard ratio, 95% CI and p-value).
So, if using cox model, how should I do about this data?
cicu_cox <- coxph(Surv(tstart,tstop, status ) ~ (four_bin) + age + gender + apache_iv +
charlson_score +furosemide + vaso_inotro+
vaso_use_firstday+inotrope_use_firstday+
intubated_firstday+history_chf+
admission_mi+admission_chf+admission_sepsis, data = sicu_dat )
Is the code above ok not caring about the violation of PH assumption e.g. four_bin, vaso_inotro or age these variables?
With the help of your code, I run them on my complete data and got these results as below:
mod <- bam(
formula = ped_status ~
ti(tend) +
gender + vaso_use_firstday + intubated_firstday +
history_mi+history_chf+history_renal_failure+admission_mi+
admission_chf+admission_sepsis+
four_bin + ti(tend, by = as.ordered(four_bin)) +
furosemide + ti(tend, by = as.ordered(furosemide)) +
vaso_inotro + ti(tend, by = as.ordered(vaso_inotro)) +
te(age, tend) +
te(tend, apache_iv) +
te(tend, charlson_score),
data = ped,
family = poisson(),
offset = offset,
method = "fREML",
control = list(trace = TRUE))
variables name with 'history/admissin/firstday' are category viriables and time-constant
summary(mod)
Family: poisson
Link function: log
Formula:
ped_status ~ ti(tend) + gender + vaso_use_firstday + intubated_firstday +
history_mi + history_chf + history_renal_failure + admission_mi +
admission_chf + admission_sepsis + four_bin + ti(tend, by = as.ordered(four_bin)) +
furosemide + ti(tend, by = as.ordered(furosemide)) + vaso_inotro +
ti(tend, by = as.ordered(vaso_inotro)) + te(age, tend) +
te(tend, apache_iv) + te(tend, charlson_score)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.95760 0.08233 -108.798 < 2e-16 ***
genderm 0.12847 0.04645 2.766 0.005677 **
vaso_use_firstday1 0.47428 0.07433 6.380 1.77e-10 ***
intubated_firstday1 -0.06134 0.05910 -1.038 0.299322
history_mi1 0.27369 0.07482 3.658 0.000254 ***
history_chf1 -0.03518 0.07013 -0.502 0.615939
history_renal_failure1 0.12277 0.07348 1.671 0.094757 .
admission_mi1 0.73521 0.09929 7.405 1.32e-13 ***
admission_chf1 -0.61358 0.15826 -3.877 0.000106 ***
admission_sepsis1 -0.57086 0.06784 -8.415 < 2e-16 ***
four_bin2 0.12769 0.08516 1.499 0.133765
four_bin3 0.68679 0.10595 6.482 9.05e-11 ***
four_bin4 1.04718 0.17645 5.935 2.95e-09 ***
furosemide1 1.58119 0.14865 10.637 < 2e-16 ***
vaso_inotro1 0.41245 0.14160 2.913 0.003583 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
ti(tend) 2.246 2.435 35.023 6.37e-06 ***
ti(tend):as.ordered(four_bin)2 1.005 1.010 1.593 0.20928
ti(tend):as.ordered(four_bin)3 3.620 3.894 30.813 1.39e-05 ***
ti(tend):as.ordered(four_bin)4 2.730 3.178 12.868 0.00629 **
ti(tend):as.ordered(furosemide)1 2.185 2.634 3.757 0.28949
ti(tend):as.ordered(vaso_inotro)1 3.891 3.990 180.727 < 2e-16 ***
te(age,tend) 5.367 20.000 46.070 4.95e-11 ***
te(tend,apache_iv) 10.750 20.000 133.254 < 2e-16 ***
te(tend,charlson_score) 1.588 2.000 0.969 0.61694
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0177 Deviance explained = 18.2%
fREML = 3.192e+06 Scale est. = 1 n = 1797059
Does the edf value indicates the linear effect of variables? For example, if the edf value is about 1 and the variable has linear effect, or non-linear effect if greater than 1
## 1d smooths
smooths1d <- tidy_smooth(mod)
ggplot(smooths1d, aes(x = x, y = fit)) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = .2) +
geom_line() +
facet_wrap(~ylab, scales = "free")
## 2d smooths
gg_tensor(mod) + xlab("time")
## slices through 2d smooths
# age effect
gg_slice(ped, mod, "age", age = seq_range(age, 20), tend = seq_range(tend, 3))
gg_slice(ped, mod, "age", tend = unique(tend), age = seq_range(age, 3))
# apache effect
gg_slice(ped, mod, "apache_iv", apache_iv = seq_range(apache_iv, 20),
tend = seq_range(tend, 3))
gg_slice(ped, mod, "apache_iv", tend = unique(tend),
apache_iv = seq_range(apache_iv, 3))
# charlson_score effect
gg_slice(ped, mod, "charlson_score",
charlson_score = seq_range(charlson_score, 20), tend = seq_range(tend, 3))
gg_slice(ped, mod, "charlson_score", tend = unique(tend),
charlson_score = seq_range(charlson_score, 3))
According to the bam model , I think that four_bin(3,4), vaso_inotro, age and apache_iv
variables have
non-linear effect. Is that right?
Next, how to do using COX model according to these info? I need your guidance!!!
Thanks very much! Looking forward your reply.
1) As I said, I would model time-dependent covariates as time-varying effects as you have done above, unless you assume that the same value of the covariate will have a different effect on the hazard depending on when it occurs (and there might be some interpretation issues: Does the effect of a value of the covariate change or do the covariate values only occur at later time-points, etc. )
2) seq_range(variable, 3)
was only an illustration, you might want to set specific values of variable
that you are interested in. For example, you should check if the apache_iv
effect is plausible (is -1
even a possible value)
3) This is a package repository, accordingly issues/questions should concern package functionality, for example your data transformation question. Questions on model interpretation however, are better asked on Cross Validated. Also, our tutorial paper is now freely available from the homepage. Maybe that will help you with interpretation.
4) We developed the methods (PAM) and package exactly because in our opinion it offers more flexibility and better estimation procedures compared to the existing cox implementations (e.g. in the survival package). You will probably be able to run similar model in survival::coxph
or rms::cph
, but you will have to ask the respective package authors about it. Personally, I think there is no good reason to fit COX instead of PAM in this situation, but that's for you to decide.
Hi , Now I have a dataset with time-change variables(multiple rows for each subject), the data format is like this: tstart | tstop | status and the time interval are regular(2 hour time-span). ![Uploading image.png…]()
I have done survival analysis using cox model with this data, and there exist some variables violates the COX assumption.
I want to use gammodel to solve the problem. but I don't know how to transform my data into right data format suitable for gam(my data not like pbc or pbcseq, so
may not work)