Closed darentsai closed 7 months ago
Thank you! Your change is correct. I very much appreciate your PR and explanation. I'll merge this in now.
@bcjaeger Thanks for your quick reply and accepting the PR.
I'm reading your excellent work published on JCGS(2024) about AORSF, and some contents confuse me a little. Could I ask you about it here?
w <- sample(from={0,...,10},size=nrow(xtrain),replace=T)
(the weights w is then used in Newton-Raphson scoring to solve betas) It's the first time I've seen this Bootstrap scheme. Are there literatures mentioning this approach? (Or any keyword I can google?)
Thanks for any reply!
@darentsai no problem! Thank you for reviewing the code and for your kind words about the paper.
You've asked two very good questions.
I have also read your article published on AOAS (2019) about ORSF. In that work, you used a regularized Cox model to calculate the decision boundary. But in the recent paper, you used a standard Cox model. Besides improving computational speed, are there any other reasons for this change?
There is no benefit apart from the improved computational speed. One of the most interesting findings from the JCGS paper is that the much faster version of using 1 iteration of the newton-raphson procedure ends up being practically equivalent in terms of C-statistic compared to the much more thorough penalized Cox regression approach from the AOAS paper.
You said that memory is conserved by conducting bootstrap resampling via "random integer-valued weights". Are there literatures mentioning this approach? (Or any keyword I can google?)
I would guess that other packages (e.g., ranger
) do this to make bootstrapping more efficient. I don't know of a good prior reference for this approach (you can always cite the JCGS paper though). I don't think you need a reference for this approach though, as you can convince yourself it is asymptotically identical to the usual bootstrap procedure with code.
# bootstrap a standard error
n_obs <- 1000
n_boots <- 10000
x <- rnorm(n_obs)
boot_stats_regular <- vector('numeric', n_boots)
boot_stats_weights <- vector('numeric', n_boots)
for(i in seq(n_boots)){
# regular bootstrap approach
index <- sample(n_obs, replace = TRUE)
# weighted bootstrap (using same sample as regular)
weights <- vector("numeric", n_obs)
for(w in seq_along(weights)) weights[w] <- sum(index==w)
# do this to match what aorsf does (i.e., no getting sampled > 10 times)
weights <- pmin(weights, 10)
boot_stats_regular[i] <- mean(x[index])
boot_stats_weights[i] <- weighted.mean(x, w = weights)
}
# check
sd(boot_stats_regular)
#> [1] 0.03138031
sd(boot_stats_weights)
#> [1] 0.03138031
t.test(x)$stderr
#> [1] 0.03088334
Created on 2024-05-12 with reprex v2.1.0
@bcjaeger Very appreciate your kindness and detailed explanation!
After testing your example code, I clearly understand that using "random integer-valued weights" indeed mimics the bootstrap procedure. Initially I had doubts about whether it's valid because of this line of pseudocode from JCGS paper:
$$w \leftarrow sample(from = \{0, ..., 10\}, size = nrow(x_{train}), replace = T)$$
I intuitively thought that the weights are drawn from a discrete uniform distribution from 0 to 10. However, If we need to use a statistical distribution to describe the number of times an observation is selected in a bootstrap sample, it should be binomial or Poisson.
I adapt your example code to include weights generated by the binomial, Poisson, and uniform distributions. It shows that uniformly distributed weights underestimate the standard error. Does it mean the pseudocode is misleading?
n_obs <- 1000
n_boots <- 10000
x <- rnorm(n_obs)
boot_stats_weights <- vector('numeric', n_boots)
boot_stats_weights_binom <- vector('numeric', n_boots)
boot_stats_weights_pois <- vector('numeric', n_boots)
boot_stats_weights_unif <- vector('numeric', n_boots)
for(i in seq(n_boots)){
# weighted bootstrap
index <- sample(n_obs, replace = TRUE)
weights <- vector("numeric", n_obs)
for(w in seq_along(weights)) weights[w] <- sum(index == w)
# weighted bootstrap (Binomial distribution)
w_binom <- rbinom(n_obs, size = n_obs, prob = 1/n_obs)
# weighted bootstrap (Poisson distribution)
w_pois <- rpois(n_obs, lambda = 1)
# weighted bootstrap (Uniform distribution)
w_unif <- sample(0:10, size = n_obs, replace = TRUE)
boot_stats_weights[i] <- weighted.mean(x, w = weights)
boot_stats_weights_binom[i] <- weighted.mean(x, w = w_binom)
boot_stats_weights_pois[i] <- weighted.mean(x, w = w_pois)
boot_stats_weights_unif[i] <- weighted.mean(x, w = w_unif)
}
# check
t.test(x)$stderr # 0.03226258
sd(boot_stats_weights) # 0.03236514
sd(boot_stats_weights_binom) # 0.03218649
sd(boot_stats_weights_pois) # 0.03287533
sd(boot_stats_weights_unif) # 0.02058976
Ahh I see. This is a very helpful illustration!
I can see how the pseudocode implies uniform sampling. That was not my intention and I apologize for the confusion 😞
The sampling procedure used to mimic bootstrapping is carried out mostly by the code below
uword i, draw, n = data->n_rows;
std::uniform_int_distribution<uword> udist_rows(0, n - 1);
if(sample_with_replacement){
for (i = 0; i < n; ++i) {
draw = udist_rows(random_number_generator);
++w_inbag[draw];
}
}
do you think we could describe this with pseudocode? I am thinking maybe something like this:
probs <- (1 / n_obs) ^ (seq(0, 10))
w <- sample(seq(0, 10), size = n_obs, replace = TRUE, probs = probs)
Maybe using a loop to express binomial sampling is easy to read, but it could be somewhat lengthy for pseudocode.
w <- rep(0, times = nrow(x))
for(i in seq_along(w)) {
pos <- sample(from = seq_along(w), size = 1)
w[pos] <- w[pos] + 1
}
Your idea is nice! But the probabilities probs
may not align with the distribution behind bootstrap. I run it with R and get a vector of weights with all values being 0. Perhaps using probabilities from binomial distribution would be reasonable.
probs <- choose(n_obs, seq(0, 10)) * (1 / n_obs) ^ seq(0, 10) * (1 - 1 / n_obs) ^ (n_obs - seq(0, 10))
# equivalent to
probs <- dbinom(0:10, n_obs, 1 / n_obs)
w <- sample(seq(0, 10), size = n_obs, replace = TRUE, prob = probs)
I'm not sure if putting this arithmetic into an algorithm pseudocode is proper or not. Maybe we can just write:
$$ probs \leftarrow prob\_binomial(x = seq(0, 10), n\_trial = n\_obs, prob = 1 / n\_obs)$$
$$ w \leftarrow sample(from = seq(0, 10), size = n\_obs, replace = TRUE, probs = probs) $$
Sorry I don't have better ideas now...
Oh right, probs
in my idea was off b/c it didn't include the probability of not being sampled. dbinom
works very well. How about this?
$probs \leftarrow dbinom(seq(0, 10), size = n_{obs}, prob = 1 / n_{obs})$
$w \leftarrow sample(seq(0, 10), size = n_{obs}, replace = TRUE, probs = probs) $
It looks great! My concern is that $dbinom(...)$ is R-specific syntax, and I'm not sure if those non R-users will understand what it means. So I use the word $prob\_binomial(...)$ in the previous comment to make it more informative. If you think it's not an issue, you can go ahead and use $dbinom$.
Another option is
$$w \leftarrow sample \ binomial(n{obs}, size = n{obs}, probs = 1/n{obs})$$
corresponding to R command:
w <- rbinom(n_obs, size = n_obs, prob = 1/n_obs)
This way saves the line to define $probs \leftarrow (...)$ and avoid an upper bound 10.
best regards,
I love this!
I would like to re-run this analysis with xgboost predictions working as intended and then make the update to my pseudocode. This will take a little while since the analysis requires a lot of computing time, but when it's done, May I include an acknowledgment to you in the paper for your help?
Of course, It's my pleasure! I'm still a first-year PhD student with limited insight, so communicating with you about this project is a valuable experience for me. I use the name Dai-Rong Tsai in academic articles. You could mention me in the paper. I add my email to my profile intro. Feel free to contact me whenever needed.
Dear Maintainer,
I'm trying to train a
xgboost
model with the Cox proportional hazard loss using your codes. You estimated the baseline cumulative hazard $H_0(t)$ bygbm::basehaz.gbm
.https://github.com/bcjaeger/aorsf-bench/blob/c8a9b5b6fd1caa078c50bad52664c5cb993b1268/R/model_fit.R#L397-L406
I search its document and see that the parameter
f.x
needs to be the predicted values of the regression model on the log hazard scale. However, you passed the output ofpredict(fit, newdata = xmat)
into it, where thepredict
method for class'xgb.Booster'
defaults to return predictions on the hazard ratio scale (i.e., as $HR = exp(X\beta)$ in the proportional hazard function $h(t) = h_0(t) * HR$), instead of log hazard scale $log(HR) = X\beta$.I think you may need to set
outputmargin = TRUE
before passing it intobasehaz.gbm
, i.e.outputmargin
controls whether the predictions should be returned in the form of original untransformed sum of predictions from boosting iterations' results.The same issue appears in the
xgb_cox_pred
function frommodel_pred.R
.https://github.com/bcjaeger/aorsf-bench/blob/c8a9b5b6fd1caa078c50bad52664c5cb993b1268/R/model_pred.R#L209-L213
It seems that you use the formula $S(t) = exp(exp(X\beta) * -H_0(t))$ to evaluate survival probabilities.
lin_preds
you define has been $exp(X\beta)$, and you take one moreexp()
outside of it. So I think you also need to setoutputmargin = TRUE
here.Maybe all above are my misunderstanding! Thank you for any replies. This project is an outstanding work!
Best regards,