`predict` overwrites previously returned predictions

StevenWhitaker commented 11 months ago

I don't know if this is a bug or a feature, or if I'm doing something wrong, but predict overwrites results predict returned previously.

MWE:

julia> using DataFrames, XGBoost

julia> dtrain = DataFrame(x = rand(10));

julia> dtest = DataFrame(x = rand(5));

julia> Xytrain = DMatrix(dtrain; label_lower_bound = rand(10), label_upper_bound = rand(10));

julia> Xytest = DMatrix(dtest; label_lower_bound = rand(5), label_upper_bound = rand(5));

julia> bst = xgboost(Xytrain; objective = "survival:aft", eval_metric = "aft-nloglik")
[ Info: XGBoost: starting training.
[ Info: [1]     train-aft-nloglik:17.16976268540767236
[ Info: [2]     train-aft-nloglik:17.17062846556910216
[ Info: [3]     train-aft-nloglik:17.17126957073342908
[ Info: [4]     train-aft-nloglik:17.17210819966958724
[ Info: [5]     train-aft-nloglik:17.17265881877984413
[ Info: [6]     train-aft-nloglik:17.17319223634340020
[ Info: [7]     train-aft-nloglik:17.17374498113908032
[ Info: [8]     train-aft-nloglik:17.17410712376969784
[ Info: [9]     train-aft-nloglik:17.17450418489510611
[ Info: [10]    train-aft-nloglik:17.17485112239023692
[ Info: Training rounds complete.
Booster()

julia> pred = predict(bst, Xytrain)
10-element Vector{Float32}:
 0.49251196
 0.5615733
 0.3805378
 0.46858168
 0.5392939
 0.51840335
 0.3805378
 0.51840335
 0.5615733
 0.49329242

julia> predict(bst, Xytest)
5-element Vector{Float32}:
 0.5392939
 0.3805378
 0.51840335
 0.5615733
 0.5615733

julia> pred
10-element Vector{Float32}:
 0.5392939
 0.3805378
 0.51840335
 0.5615733
 0.5615733
 0.51840335
 0.3805378
 0.51840335
 0.5615733
 0.49329242

In particular, note that the first 5 elements of pred are overwritten by the second call to predict. I don't know if this behavior is specific to the objective and/or eval metric I used or if it is more general.

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900HX
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, goldmont)
  Threads: 32 on 32 virtual cores
Environment:
  JULIA_NUM_THREADS = 32
  JULIA_PKG_USE_CLI_GIT = true

(tmp) pkg> st
Status `/mnt/c/Users/steve/Documents/tmp/Project.toml`
  [336ed68f] CSV v0.10.11
  [1b08a953] Dash v1.3.0
  [1b08a953] DashBootstrapComponents v1.4.1
  [a93c6f00] DataFrames v1.6.1
  [a03496cd] PlotlyBase v0.8.19
  [009559a3] XGBoost v2.3.1
  [2a0f44e3] Base64
  [56ddb016] Logging
  [9a3f8284] Random
  [10745b16] Statistics v1.9.0

ExpandingMan commented 11 months ago

Yikes, that must have been confusing. libxgboost returns a pointer to us on predict and apparently it is using the same memory location, which is clearly not something we considered when writing this wrapper.

The only way around this, unfortunately, is to copy the entire output array, but it seems pretty obvious that that should be the default behavior, and it's implemented in #188.

In that PR I have added the (unexported) function predict_nocopy for users who don't want the extra allocation.

StevenWhitaker commented 11 months ago

Awesome, thanks for the quick fix! And your fix was exactly my work-around: just copy.

It did take me a good hour or so to figure out why my results looked funny, but I'm glad I was able to track it down and hopefully prevent headache for someone else down the line.

dmlc / XGBoost.jl

`predict` overwrites previously returned predictions #187