Add `outputmargin = TRUE` in `predict` of a xgboost model

bcjaeger / aorsf-bench

benchmarking the aorsf package

https://arxiv.org/abs/2208.01129

2 stars 5 forks source link

Add `outputmargin = TRUE` in `predict` of a xgboost model #7

Closed darentsai closed 7 months ago

darentsai commented 7 months ago

Dear Maintainer,

I'm trying to train a xgboost model with the Cox proportional hazard loss using your codes. You estimated the baseline cumulative hazard $H_0(t)$ by gbm::basehaz.gbm.

https://github.com/bcjaeger/aorsf-bench/blob/c8a9b5b6fd1caa078c50bad52664c5cb993b1268/R/model_fit.R#L397-L406

I search its document and see that the parameter f.x needs to be the predicted values of the regression model on the log hazard scale. However, you passed the output of predict(fit, newdata = xmat) into it, where the predict method for class 'xgb.Booster' defaults to return predictions on the hazard ratio scale (i.e., as $HR = exp(X\beta)$ in the proportional hazard function $h(t) = h_0(t) * HR$), instead of log hazard scale $log(HR) = X\beta$.

I think you may need to set outputmargin = TRUE before passing it into basehaz.gbm, i.e.

lin_preds <- predict(fit, newdata = xmat, outputmargin = TRUE)

outputmargin controls whether the predictions should be returned in the form of original untransformed sum of predictions from boosting iterations' results.

The same issue appears in the xgb_cox_pred function from model_pred.R.

https://github.com/bcjaeger/aorsf-bench/blob/c8a9b5b6fd1caa078c50bad52664c5cb993b1268/R/model_pred.R#L209-L213

It seems that you use the formula $S(t) = exp(exp(X\beta) * -H_0(t))$ to evaluate survival probabilities. lin_preds you define has been $exp(X\beta)$, and you take one more exp() outside of it. So I think you also need to set outputmargin = TRUE here.

Maybe all above are my misunderstanding! Thank you for any replies. This project is an outstanding work!

Best regards,

bcjaeger commented 7 months ago

Thank you! Your change is correct. I very much appreciate your PR and explanation. I'll merge this in now.

darentsai commented 6 months ago

@bcjaeger Thanks for your quick reply and accepting the PR.

I'm reading your excellent work published on JCGS(2024) about AORSF, and some contents confuse me a little. Could I ask you about it here?

I have also read your article published on AOAS (2019) about ORSF. In that work, you used a regularized Cox model to calculate the decision boundary. But in the recent paper, you used a standard Cox model. Besides improving computational speed, are there any other reasons for this change?
You said that memory is conserved by conducting bootstrap resampling via "random integer-valued weights", and you implement this part coded below in the Algorithm:

w <- sample(from={0,...,10},size=nrow(xtrain),replace=T)

(the weights w is then used in Newton-Raphson scoring to solve betas) It's the first time I've seen this Bootstrap scheme. Are there literatures mentioning this approach? (Or any keyword I can google?)

Thanks for any reply!

bcjaeger commented 6 months ago

@darentsai no problem! Thank you for reviewing the code and for your kind words about the paper.

You've asked two very good questions.

I have also read your article published on AOAS (2019) about ORSF. In that work, you used a regularized Cox model to calculate the decision boundary. But in the recent paper, you used a standard Cox model. Besides improving computational speed, are there any other reasons for this change?

There is no benefit apart from the improved computational speed. One of the most interesting findings from the JCGS paper is that the much faster version of using 1 iteration of the newton-raphson procedure ends up being practically equivalent in terms of C-statistic compared to the much more thorough penalized Cox regression approach from the AOAS paper.

You said that memory is conserved by conducting bootstrap resampling via "random integer-valued weights". Are there literatures mentioning this approach? (Or any keyword I can google?)

I would guess that other packages (e.g., ranger) do this to make bootstrapping more efficient. I don't know of a good prior reference for this approach (you can always cite the JCGS paper though). I don't think you need a reference for this approach though, as you can convince yourself it is asymptotically identical to the usual bootstrap procedure with code.

# bootstrap a standard error

n_obs <- 1000
n_boots <- 10000
x <- rnorm(n_obs)

boot_stats_regular <- vector('numeric', n_boots)
boot_stats_weights <- vector('numeric', n_boots)

for(i in seq(n_boots)){
 # regular bootstrap approach
 index <- sample(n_obs, replace = TRUE)

 # weighted bootstrap (using same sample as regular)
 weights <- vector("numeric", n_obs)
 for(w in seq_along(weights)) weights[w] <- sum(index==w)

 # do this to match what aorsf does (i.e., no getting sampled > 10 times)
 weights <- pmin(weights, 10)

 boot_stats_regular[i] <- mean(x[index])
 boot_stats_weights[i] <- weighted.mean(x, w = weights)
}

# check
sd(boot_stats_regular)
#> [1] 0.03138031
sd(boot_stats_weights)
#> [1] 0.03138031
t.test(x)$stderr
#> [1] 0.03088334

^{Created on 2024-05-12 with reprex v2.1.0}

darentsai commented 6 months ago

@bcjaeger Very appreciate your kindness and detailed explanation!

After testing your example code, I clearly understand that using "random integer-valued weights" indeed mimics the bootstrap procedure. Initially I had doubts about whether it's valid because of this line of pseudocode from JCGS paper:

$$w \leftarrow sample(from = \{0, ..., 10\}, size = nrow(x_{train}), replace = T)$$

I intuitively thought that the weights are drawn from a discrete uniform distribution from 0 to 10. However, If we need to use a statistical distribution to describe the number of times an observation is selected in a bootstrap sample, it should be binomial or Poisson.

I adapt your example code to include weights generated by the binomial, Poisson, and uniform distributions. It shows that uniformly distributed weights underestimate the standard error. Does it mean the pseudocode is misleading?

n_obs <- 1000
n_boots <- 10000
x <- rnorm(n_obs)

boot_stats_weights <- vector('numeric', n_boots)
boot_stats_weights_binom <- vector('numeric', n_boots)
boot_stats_weights_pois <- vector('numeric', n_boots)
boot_stats_weights_unif <- vector('numeric', n_boots)

for(i in seq(n_boots)){

  # weighted bootstrap
  index <- sample(n_obs, replace = TRUE)
  weights <- vector("numeric", n_obs)
  for(w in seq_along(weights)) weights[w] <- sum(index == w)

  # weighted bootstrap (Binomial distribution)
  w_binom <- rbinom(n_obs, size = n_obs, prob = 1/n_obs)

  # weighted bootstrap (Poisson distribution)
  w_pois <- rpois(n_obs, lambda = 1)

  # weighted bootstrap (Uniform distribution)
  w_unif <- sample(0:10, size = n_obs, replace = TRUE)

  boot_stats_weights[i] <- weighted.mean(x, w = weights)
  boot_stats_weights_binom[i] <- weighted.mean(x, w = w_binom)
  boot_stats_weights_pois[i] <- weighted.mean(x, w = w_pois)
  boot_stats_weights_unif[i] <- weighted.mean(x, w = w_unif)
}

# check
t.test(x)$stderr               # 0.03226258
sd(boot_stats_weights)         # 0.03236514
sd(boot_stats_weights_binom)   # 0.03218649
sd(boot_stats_weights_pois)    # 0.03287533
sd(boot_stats_weights_unif)    # 0.02058976

bcjaeger commented 6 months ago

Ahh I see. This is a very helpful illustration!

I can see how the pseudocode implies uniform sampling. That was not my intention and I apologize for the confusion 😞

The sampling procedure used to mimic bootstrapping is carried out mostly by the code below


uword i, draw, n = data->n_rows;

std::uniform_int_distribution<uword> udist_rows(0, n - 1);

  if(sample_with_replacement){

   for (i = 0; i < n; ++i) {
    draw = udist_rows(random_number_generator);
    ++w_inbag[draw];
   }

  }

do you think we could describe this with pseudocode? I am thinking maybe something like this:

probs <- (1 / n_obs) ^ (seq(0, 10))
w <- sample(seq(0, 10), size = n_obs, replace = TRUE, probs = probs)

darentsai commented 6 months ago

Maybe using a loop to express binomial sampling is easy to read, but it could be somewhat lengthy for pseudocode.

w <- rep(0, times = nrow(x))
for(i in seq_along(w)) {
  pos <- sample(from = seq_along(w), size = 1)
  w[pos] <- w[pos] + 1
}

Your idea is nice! But the probabilities probs may not align with the distribution behind bootstrap. I run it with R and get a vector of weights with all values being 0. Perhaps using probabilities from binomial distribution would be reasonable.

probs <- choose(n_obs, seq(0, 10)) * (1 / n_obs) ^ seq(0, 10) * (1 - 1 / n_obs) ^ (n_obs - seq(0, 10))
# equivalent to
probs <- dbinom(0:10, n_obs, 1 / n_obs)

w <- sample(seq(0, 10), size = n_obs, replace = TRUE, prob = probs)

I'm not sure if putting this arithmetic into an algorithm pseudocode is proper or not. Maybe we can just write:

$$ probs \leftarrow prob\_binomial(x = seq(0, 10), n\_trial = n\_obs, prob = 1 / n\_obs)$$

$$ w \leftarrow sample(from = seq(0, 10), size = n\_obs, replace = TRUE, probs = probs) $$

Sorry I don't have better ideas now...

bcjaeger commented 6 months ago

Oh right, probs in my idea was off b/c it didn't include the probability of not being sampled. dbinom works very well. How about this?

$probs \leftarrow dbinom(seq(0, 10), size = n_{obs}, prob = 1 / n_{obs})$

$w \leftarrow sample(seq(0, 10), size = n_{obs}, replace = TRUE, probs = probs) $

darentsai commented 6 months ago

It looks great! My concern is that $dbinom(...)$ is R-specific syntax, and I'm not sure if those non R-users will understand what it means. So I use the word $prob\_binomial(...)$ in the previous comment to make it more informative. If you think it's not an issue, you can go ahead and use $dbinom$.

Another option is

$$w \leftarrow sample \ binomial(n{obs}, size = n{obs}, probs = 1/n{obs})$$

corresponding to R command:

w <- rbinom(n_obs, size = n_obs, prob = 1/n_obs)

This way saves the line to define $probs \leftarrow (...)$ and avoid an upper bound 10.

best regards,

bcjaeger commented 6 months ago

I love this!

I would like to re-run this analysis with xgboost predictions working as intended and then make the update to my pseudocode. This will take a little while since the analysis requires a lot of computing time, but when it's done, May I include an acknowledgment to you in the paper for your help?

darentsai commented 6 months ago

Of course, It's my pleasure! I'm still a first-year PhD student with limited insight, so communicating with you about this project is a valuable experience for me. I use the name Dai-Rong Tsai in academic articles. You could mention me in the paper. I add my email to my profile intro. Feel free to contact me whenever needed.