dmlc / XGBoost.jl

XGBoost Julia Package
Other
288 stars 110 forks source link

TLDR Example Fails #204

Open mahiki opened 1 month ago

mahiki commented 1 month ago

I'm new to all the packages referenced here, sorry I can't diagnose.

Environment:

julia version 1.10.4

cat Project.toml
[deps]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Term = "22787eb5-b846-44ae-b979-8e399b8463ab"
XGBoost = "009559a3-9522-5dbb-924b-0b6ed2b22bb9"

pkg> st
Status ...etc
  [a93c6f00] DataFrames v1.6.1
  [22787eb5] Term v2.0.6
  [009559a3] XGBoost v2.5.1

I tried out the TLDR example in the docs:

using DataFrames, XGBoost, Term

# training set of 100 datapoints of 4 features
(X, y) = (randn(100,4), randn(100))

# create and train a gradient boosted tree model of 5 trees
bst = xgboost((X, y), num_round=5, max_depth=6, objective="reg:squarederror")

# obtain model predictions
ŷ = predict(bst, X)

df = DataFrame(randn(100,3), [:a, :b, :y])

# can accept tabular data, will keep feature names
bst = xgboost((df[!, [:a, :b]], df.y))

# display importance statistics retaining feature names
importancereport(bst)

╭───────────┬────────────┬──────────┬───────────┬──────────────┬───────────────╮
│  feature  │    gain    │  weight  │   cover   │  total_gain  │  total_cover  │
├───────────┼────────────┼──────────┼───────────┼──────────────┼───────────────┤
│    "b"    │  1.16553   │   82.0   │  34.7073  │   95.5735    │    2846.0     │
├───────────┼────────────┼──────────┼───────────┼──────────────┼───────────────┤
│    "a"    │  0.918156  │  108.0   │  26.463   │   99.1608    │    2858.0     │
╰───────────┴────────────┴──────────┴───────────┴──────────────┴───────────────╯

So far so good

trees(bst)

#... stack trace in next comment
mahiki commented 1 month ago
julia> trees(bst)
10-element Vector{XGBoost.Node}:
 Error showing value of type Vector{XGBoost.Node}:
ERROR: MethodError: no method matching print_tree(::typeof(Term.Trees.print_node), ::typeof(Term.Trees.print_key), ::IOBuffer, ::OrderedCollections.OrderedDict{…}; charset::AbstractTrees.TreeCharSet, printkeys::Bool, prefix::String, title_style::String)

Closest candidates are:
  print_tree(::Function, ::Function, ::IO, ::Any; maxdepth, indicate_truncation, charset, printkeys, depth, prefix, printnode_kw) got unsupported keyword argument "title_style"
   @ AbstractTrees ~/.julia/packages/AbstractTrees/Ftf8W/src/printing.jl:194
  print_tree(::Function, ::IO, ::Any; kw...)
   @ AbstractTrees ~/.julia/packages/AbstractTrees/Ftf8W/src/printing.jl:276
  print_tree(::Any; kw...)
   @ AbstractTrees ~/.julia/packages/AbstractTrees/Ftf8W/src/printing.jl:278
  ...

Stacktrace:
  [1] kwerr(::@NamedTuple{…}, ::Function, ::Function, ::Function, ::IOBuffer, ::OrderedCollections.OrderedDict{…})
    @ Base ./error.jl:165
  [2] (::Term.Trees.var"#4#5"{AbstractTrees.TreeCharSet, Bool, typeof(Term.Trees.print_node), typeof(Term.Trees.print_key), String, @Kwargs{title_style::String}})(io::IOBuffer)
    @ Term.Trees ~/.julia/packages/Term/u4VQK/src/trees.jl:147
  [3] sprint(::Function; context::Nothing, sizehint::Int64)
    @ Base ./strings/io.jl:114
  [4] sprint
    @ ./strings/io.jl:107 [inlined]
  [5] Term.Trees.Tree(tree::OrderedCollections.OrderedDict{…}; guides::Symbol, theme::Term.Theme, printkeys::Bool, print_node_function::Function, print_key_function::Function, title::String, prefix::String, kwargs::@Kwargs{…})
    @ Term.Trees ~/.julia/packages/Term/u4VQK/src/trees.jl:146
  [6] Term.Trees.Tree(node::XGBoost.Node; title::String, title_style::String, kwargs::@Kwargs{})
    @ XGBoostTermExt ~/.julia/packages/XGBoost/nqMqQ/ext/XGBoostTermExt.jl:71
  [7] Tree
    @ ~/.julia/packages/XGBoost/nqMqQ/ext/XGBoostTermExt.jl:65 [inlined]
  [8] Panel(node::XGBoost.Node; width::Int64, kw::@Kwargs{})
    @ XGBoostTermExt ~/.julia/packages/XGBoost/nqMqQ/ext/XGBoostTermExt.jl:81
  [9] Panel
    @ ~/.julia/packages/XGBoost/nqMqQ/ext/XGBoostTermExt.jl:74 [inlined]
 [10] show
    @ ~/.julia/packages/XGBoost/nqMqQ/ext/XGBoostTermExt.jl:93 [inlined]
 [11] show(io::IOContext{IOBuffer}, m::String, x::XGBoost.Node)
    @ Base.Multimedia ./multimedia.jl:123
 [12] sprint(::Function, ::String, ::Vararg{Any}; context::IOContext{Base.TTY}, sizehint::Int64)
    @ Base ./strings/io.jl:112
 [13] sprint
    @ ./strings/io.jl:107 [inlined]
 [14] print_matrix_row(io::IOContext{Base.TTY}, X::AbstractVecOrMat, A::Vector{Tuple{Int64, Int64}}, i::Int64, cols::Vector{Int64}, sep::String, idxlast::Int64)
    @ Base ./arrayshow.jl:108
 [15] _print_matrix(io::IOContext{…}, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{…}, colsA::UnitRange{…})
    @ Base ./arrayshow.jl:213
 [16] print_matrix(io::IOContext{Base.TTY}, X::Vector{XGBoost.Node}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64)
    @ Base ./arrayshow.jl:171
 [17] print_matrix
    @ ./arrayshow.jl:171 [inlined]
 [18] print_array
    @ ./arrayshow.jl:358 [inlined]
 [19] show(io::IOContext{Base.TTY}, ::MIME{Symbol("text/plain")}, X::Vector{XGBoost.Node})
    @ Base ./arrayshow.jl:399
 [20] (::REPL.var"#55#56"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:273
 [21] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:569
 [22] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:259
 [23] display
    @ ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:278 [inlined]
 [24] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:340
 [25] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:0
 [26] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:284
 [27] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:569
 [28] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:282
 [29] (::REPL.var"#do_respond#80"{Bool, Bool, REPL.var"#93#103"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:911
 [30] #invokelatest#2
    @ ./essentials.jl:892 [inlined]
 [31] invokelatest
    @ ./essentials.jl:889 [inlined]
 [32] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/LineEdit.jl:2656
 [33] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1312
 [34] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ~/.julia/juliaup/julia-1.10.4+0.aarch64.apple.darwin14/share/julia/stdlib/v1.10/REPL/src/REPL.jl:386
Some type information was truncated. Use `show(err)` to see complete types.
mahiki commented 1 month ago

Looks like failure from Term.jl dependency

Note that tree(bst) from the TLDR succeeds if I don't import Term:

using DataFrames, XGBoost
    # ...etc

# expected
julia> importancereport(bst)
ERROR: MethodError: no method matching importancereport(::Booster)

julia> importance(bst)
    # OrderedCollections.OrderedDict{String, Vector{Float32}} with 2 entries:
    # "a" => [0.709936]
    # "b" => [0.644888]

julia> trees(bst)
    # 10-element Vector{XGBoost.Node}:
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=a)
    # XGBoost.Node(split_feature=b)
    # XGBoost.Node(split_feature=a)

julia> for i in propertynames(ts[1])
           @printf "%-20s    %12s\n" i getproperty(ts[1], i)
       end
id                                 0
depth                              0
split                              b
split_condition           1.22669697
yes                                1
no                                 2
nmissing                           2
gain                      4.72399569
cover                          100.0
leaf                         nothing
children                XGBoost.Node[XGBoost.Node(split_feature=a), XGBoost.Node(split_feature=b)]
mahiki commented 1 month ago

It really dinged my confidence when the Hello World for this package failed, and I messed around with versions trying to find an environment where it works -- wasted time.

I suggest either:

Or maybe someone experienced can see an easy fix! Let me know I'm happy to PR either of the above.

ExpandingMan commented 1 month ago

This is either an error in Term.jl in that it's passing keyword arguments to print_tree that it shouldn't, or an error in XGBoostTermExt in that it is creating a Term.Tree with arguments that can't be passed into print_tree. Either way, the Term tree output doesn't seem to be making a lot of sense now even when this is fixed by removing those arguments.

We probably should remove the term tree visualizations as that doesn't really seem like something we'll want to maintain and it was perhaps a bit overenthusiastic to put it there in the first place.

mahiki commented 1 month ago

Makes sense, something fragile about this dependency reminiscent of a 🔥 julia discourse not long ago.

Is tree meant to be a 'show' type of function, ancillary to the model in production?

Visualization features don't need to be in scope, although the explainability of tree models is one of the big benefits. Package users can manage that outside of this package, maybe just a good how-to example in the docs is a good tradeoff.

I'm eager to help on this, esp. having been useless in helping on Parquet2.jl. I expect to be using this package a lot.

Edit: a little guidance on general direction to go and I'd be able to make a PR

ExpandingMan commented 1 month ago

This wrapper defines a Node object that's supposed to aid users with introspection of the trees, as the library itself provides text output. Visualization of that kind of stuff is admittedly a hard problem. If we cut down on the visualization stuff, users should still be able to find out whatever they need with Node, it just may be a little more work.

If you want to remove all of the Node methods from the Term extension, I personally would be happy with that and would be willing to merge it. I can't speak for anyone else, so if you do make such a PR I'll want to leave it up for a few days to give people a chance to comment.

Other solutions are also welcome, but yeah I think in retrospect adding this kind of fluff to this simple wrapper package was not the greatest idea. Node should definitely stay since introspection of the trees is a pain and the help is badly needed, but trying to get the whole thing to display nicely in the general case is a big complicated problem that we shouldn't try to solve.

There is also the fraught issue of whether it would be a breaking change. I always think display stuff should not be considered part of the API and therefore not breaking, but we don't make it clear, it might have to be, so I guess it just is what it is.

ablaom commented 1 month ago

I always think display stuff should not be considered part of the API and therefore not breaking

I vote for non-breaking.

mahiki commented 1 month ago

Sorry I've been reading through the package and just learned about package extensions.

Tell me if I have this right:

So I'd say

  1. getting a stack trace on the display call in REPL is not great, maybe add error handling and fall back to Base.show?
  2. Maybe this could be fixed with a small tweak to XGBoostTermExt? That would be preferable I think.

First thing I'd like to do is add a hint to the MethodError for instancereport following this example, that would have helped me a lot as a first time user.

ExpandingMan commented 1 month ago

I think you have the first part of it right, but not the cause of the problem. The method error happens because Term.jl is passing the keyword arguments to its constructors to print_tree which does not in turn accept them. The method error was entirely appropriate, given the circumstances.

Of course it's possible to fix this but I'm not even sure what the intended behavior is, so I'm not sure whether this is a case of Term.jl misusing print_tree or a case of XGBoostTermExt misusing the Term.jl tree constructor. Either way, I'm inclined to agree with what I thought was the sentiment of your earlier post, i.e. displaying these trees nicely is a rather hard problem to solve in general and we shouldn't try to do it here. Looking at this a bit more broadly, we don't even have any real confidence that these trees will display nicely in a wide variety of cases even were it not for this error.

That said, whether you are inclined to fix this issue and restore the original behavior, or just remove the Term display for tree nodes, I'd vote to merge either case.