Open bblodfon opened 1 year ago
Hello @bblodfon,
Thank you! I am thrilled to hear aorsf
is doing well in your analysis. 😄
Then you take the mean of those values. And you have calculated risk scores per time point for each model here. Right?
This is correct, I have waffled back and forth on whether it makes sense to take a mean C-statistic over a range of times. My rationale for doing it in this analysis was that I wanted to make relative comparisons between different models rather than interpret how effectively a model would discriminate high and low risk at a specific task, so taking a mean made sense. If I were trying to estimate how effective a model would be for a specific prediction task, e.g., 10-year risk of CVD, I would just get the time-dependent value of C at 10 years.
First, do you know exactly from which paper is the AUC calculation coming from? In the paper you cite the timeROC paper Blanche (2013) but that is different?
Equation 3 in Blanche et al, 2013. This method has traversed through different software packages over the years. I include a reprex below so that you can see timeROC
and riskRegression
compute the same thing given certain inputs.
Secondly, is the above calculation really a general time-dependent C-index? I have read the work of Antolini's (2005) which uses the survival distribution prediction to derive a C-index (so it's quite general) and there are other AUC-ROC-based C-index metrics that use linear predictions (so they require Cox models and/or some random censoring assumption need to be in place). I really don't know, I just wanted your opinion on this, since the details where missing from the paper.
I would say the above calculation is just a mean of time-dependent C-statistics, and in that sense, you could say we are 'integrating out' the time dimension. One reason for taking the mean here was that otherwise, I would have to pick a specific prediction horizon for each prediction task, and I did not see a reasonably objective way to do that. Additionally, for the sake of comparing models, I thought taking the mean C-stat made the most sense (it's pretty similar to how we compare Brier scores via integration over time, right?).
Overall, my opinion on the matter is that time-dependent C-statistics are confusing because
(1) there are a lot of ways to compute them (2) the same method is implemented in multiple packages (3) it's not clear why there is an integrated Brier score and no integrated C-statistic.
I would love to see a stable R package that clears this up. Making them more accessible might just convince the majority of scientific authors to stop using Harrell's C-statistic in analyses with lots of censoring!
library(timeROC)
library(riskRegression)
#> riskRegression version 2023.03.22
library(aorsf)
library(survival)
cstat_timeroc <- timeROC(pbc_orsf$time,
delta=pbc_orsf$status,
marker=pbc_orsf$bili,
cause=1,
weighting="marginal",
times=c(1000, 2000, 3000),
iid=TRUE)
cstat_riskregression <- Score(object = list(pbc_orsf$bili),
formula = Surv(time, status) ~ 1,
data = pbc_orsf,
times = c(1000, 2000, 3000),
metrics = 'auc')
cstat_timeroc$AUC - cstat_riskregression$AUC$score$AUC
#> t=1000 t=2000 t=3000
#> 0 0 0
Created on 2023-10-11 with reprex v2.0.2
Thanks so much for your thoughts on this!
I also exchanged a few emails with Thomas Gerds and Paul Blanche to clear this up. It's as you said. In particular, the riskRegression
version is a bit more sped-up and has some minor bug corrections compared to the timeROC
. It's also pretty close to Uno's AUC from survAUC
:
library(timeROC)
library(riskRegression)
#> riskRegression version 2023.03.22
library(aorsf)
library(survival)
library(survAUC)
cstat_timeroc <- timeROC(pbc_orsf$time,
delta=pbc_orsf$status,
marker=pbc_orsf$bili,
cause=1,
weighting="marginal",
times=c(1000, 2000, 3000),
iid=TRUE)
cstat_riskregression <- Score(object = list(pbc_orsf$bili),
formula = Surv(time, status) ~ 1,
data = pbc_orsf,
times = c(1000, 2000, 3000),
metrics = 'auc')
cstat_survAUC <- survAUC::AUC.uno(
Surv.rsp = Surv(pbc_orsf$time, pbc_orsf$status),
Surv.rsp.new = Surv(pbc_orsf$time, pbc_orsf$status), # train = test data
times = c(1000, 2000, 3000),
lpnew = pbc_orsf$bili
)
cstat_timeroc$AUC - cstat_riskregression$AUC$score$AUC
#> t=1000 t=2000 t=3000
#> 0 0 0
# SAME RESULTS! USE riskRegression!
#> t=1000 t=2000 t=3000
#> 0 0 0
cstat_survAUC$auc - cstat_timeroc$AUC
#> t=1000 t=2000 t=3000
#> 2.220446e-16 -7.090504e-04 -3.869590e-03
Created on 2023-10-11 with reprex v2.0.2
My only concern is that since the above are the same (some differences in implementation for sure, but same I would call them) and their formulas reduce to Uno's AUC for right-cencored data using the Kaplan Meier for the estimation of the censoring distribution, then wouldn't Uno's AUC assumptions hold? e.g. 1-1 correspondence between predictor/marker and the expected survival times and random censoring?
FYI @RaphaelS1
I'm not sure if the equality of the C-statistics implies the assumptions are true, but I do think it makes sense that the riskRegression
and timeROC
C-stats reduce to Uno's when we don't leverage their additional features. That is reassuring.
I am still trying to piece out if the 1-1 assumption should hold or not (maybe it doesn't affect so much results in the end). My rationale comes from the following example:
If we have two or more patients/observations where their survival predictions/curves (with risk(t) = 1 - surv(t) as is documented in riskRegression::Score
) cross (as it may happen with a survival forest model) and I request the model's AUC
at t1
before the crossing and t2
afterwards? Does it even make sense to draw and evaluate the survival prediction using AUC(t) across time in this case? or even take the mean
?
Note that for a particular model or for comparing models (as you suggested) taking a single time point, then definitely I see no problem with that (and that's time-dependent AUC but not integrated, maybe integrated AUC
would have been a better name for what you calculated in the end). In the end the metric could be time-dependent, e.g. a model might perform better for t1
horizon, but another model better for t2
(later horizon).
Paul's answer to the above was:
With AUC(t), we evaluate predicted risks at a given time t (e.g. 1 year risk of death). The predicted risks at other times s in ]0,t[ are irrelevant.
I agree also with what you said:
I would have to pick a specific prediction horizon for each prediction task, and I did not see a reasonably objective way to do that.
Yes, it would have been the better solution but I see the difficulty trying to find out the optimal per task horizon (and there might be mutiple depending on the application).
But may I ask what was the vector of time points/horizons chosen for each particular task (I guess something aking to unique event points, different per test set?) I am asking because a weighted-AUC-mean (by the difference between time-points) would be better by a simple mean in the case that the time points weren't equidistant.
It depends, here is the relevant code:
event_time_bounds <- quantile(
x = data_all$time[data_all$status==1],
probs = c(.25, .75)
)
pred_horizon <- sort(unique(.test$time[.test$status == 1]))
pred_horizon <- pred_horizon[pred_horizon <= event_time_bounds['75%']]
pred_horizon <- pred_horizon[pred_horizon >= event_time_bounds['25%']]
if(is_empty(pred_horizon)) return(NULL)
if(length(pred_horizon) > 30){
pred_horizon <-
pred_horizon[floor(seq(1, length(pred_horizon), length.out=30))]
}
A time-weighted C would certainly be more like the IBS. Tldr I did not use the time-weighted C.
ok! So you take the unique event times on the combined train + test set, keep the 95% interval and you force the vector by "spreading out" the time values so as they are not more than 30.
Why the value 30 was chosen?
Arbitrary choice. Some prediction tasks had hundreds of unique event times and that made riskRegression
run very slow. Using at most 30 times was convenient but it might have been okay to use up to 50 or even 100. Depends on how many models you want to evaluate at all those timepoints.
Minimizing the number of time points to calculate the AUC at, is a very smart tactic I would say! I think we can implement something similar in mlr3proba
for efficiency.
Feel free to keep the post as a question/discussion, thanks for spending the time to answer all of this Byron!
Hi @bcjaeger,
Congrats on publishing the
aorsf
paper! It has also come up as best learner in my own work.I wanted to ask your opinion on the following regarding this benchmarking work:
I was looking at the code that calculates the time-dependent C-index and I saw you are using the
riskRegression::Score()
function and getting theAUC
for each of thetimes = pred_horizon
(vector of time points). Then you take themean
of those values. And you have calculated risk scores per time point for each model here. Right?First, do you know exactly from which paper is the AUC calculation coming from? In the paper you cite the
timeROC
paper Blanche (2013) but that is different?Secondly, is the above calculation really a general time-dependent C-index? I have read the work of Antolini's (2005) which uses the survival distribution prediction to derive a C-index (so it's quite general) and there are other AUC-ROC-based C-index metrics that use linear predictions (so they require Cox models and/or some random censoring assumption need to be in place). I really don't know, I just wanted your opinion on this, since the details where missing from the paper.
In mlr3proba we will soon include the Antolini's C-index, and after I saw this, I was considering also adding this one as it is already implemened in
RiskRegression
but we'll see!