dmlc / XGBoost.jl

XGBoost Julia Package
Other
288 stars 110 forks source link

consideration for new feature to store watchlist information #145

Closed bobaronoff closed 1 year ago

bobaronoff commented 1 year ago

I have written a function to perform cross validation. Although it works just fine, can't help but think performance would be better if one could utilize the data reported by xgboost in the course of booster creation. This data is available in the R and Python implementations. I resorted to calling 'predict' many times and calculating metrics. It works but at a performance cost.

Inspecting code in booster.jl, I see a sequence of calls: xgboost->update!->updateone!->logeval->evaliter->XGBoosterEvalOneIter

A potential target for implementation might be logeval(). This function routes the output of evaliter() to @info. If it could be expanded, perhaps the number could be parsed out of results of evaliter(), stored, and returned with booster. This could also be a target to add a 'silent' parameter which suppresses the routing to @info. This would allow collection of evaluations without filling up the REPL.

Food for thought. I defer to package maintainers as to feasibility/advisability.

PS: aspects of watchlist are a point of confusion for me. I understand substrate of train metrics. What are the substrates for the test metrics? Examples I've seen indicate a test DMatrix in the watchlist but don't understand how it is passed to libxgboost. What data is being used for the test metrics?

ExpandingMan commented 1 year ago

I'd be open to this, especially since it is already a feature in the Python and R wrappers.

Note that watchlist is an internal part of libxgboost so I don't even know what it's doing in detail, but I assume it's straightforward. I agree that it's opaqueness is annoying.

What you are suggesting would require re-implementing watchlist, which I don't necessarily think would be particularly difficult. I'm not sure how general it should be. I made the update! methods public partially with the intention of eliminating the need for callbacks, though perhaps adding a simple callback keyword argument would be a big enough simplification of the UI to be worth adding. Recording evaluation data could then be a specific callback.

ExpandingMan commented 1 year ago

By the way, I just want to make it abundantly clear that you can do what you are suggesting without too much trouble: simply populate your own data structure between calls to updateone!, e.g. something like

dm = DMatrix(X,y)
b = Booster(dm)

v = []
for j ∈ 1:n
    updateone!(b, dm)
    push!(v, predict(b, dm))
end

will store data after each iteration.

I think we're basically talking about shortening that to something like

xgboost(dm; num_round=n, callback=(b -> push!(v, predict(b, dm))))

which seems like kind of a small improvement in return for adding a new feature, but, again, I'm not necessarily against it.

bobaronoff commented 1 year ago

I see your points. Perhaps best to put this request on the back burner. I am doing something similar to your suggestion. I grow the booster all at once rather than one round at a time. I take advantage of the ntree_limit parameter of predict to obtain predictions along the creation path. The only advantage I see was to utilize libxgboost to convert the predictions in to an evaluation metric (i.e. rmse, logloss, mlogloss, etc). Perhaps I am overthinking. Julia executes briskly and calculating the metrics in Julia may not be such difference in performance - though libxgboost has lots of options. For purpose of CV curves, I would need to better understand how watchlist creates its test evaluation metric. If it's based on the out of bag rows from the subsample then it won't due for cross validation. If it's simply a second metric different than loss function - again , not helpful. Will close this item for now. Thank you so much so your feedback.