JuliaAI / CatBoost.jl

Julia wrapper of the python library CatBoost for boosted decision trees
MIT License
11 stars 3 forks source link

Save model fails #34

Closed liuyxpp closed 2 weeks ago

liuyxpp commented 3 months ago

Julia: v1.10.2 CatBoost: v0.3.4

MWE:

using CatBoost.MLJCatBoostInterface
using DataFrames
using MLJBase

# Initialize data
train_data = DataFrame([[1, 4, 30], [4, 5, 40], [5, 6, 50], [6, 7, 60]], :auto)
train_labels = [10.0, 20.0, 30.0]
eval_data = DataFrame([[2, 1], [4, 4], [6, 50], [8, 60]], :auto)

# Initialize CatBoostClassifier
model = CatBoostRegressor(; iterations=2, learning_rate=1.0, depth=2)
mach = machine(model, train_data, train_labels)

# Fit model
MLJBase.fit!(mach)

# Get predictions
preds_class = MLJBase.predict(mach, eval_data)

# Save the trained model
MLJBase.save("catboost.jls", mach)

The last line failed with:

Python: TypeError: 'CatBoostRegressor' object is not iterable

Stacktrace:
  [1] pythrow()
    @ PythonCall [~/.julia/packages/PythonCall/wXfah/src/err.jl:94](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/PythonCall/wXfah/src/err.jl:94)
  [2] errcheck
    @ [~/.julia/packages/PythonCall/wXfah/src/err.jl:10](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/PythonCall/wXfah/src/err.jl:10) [inlined]
  [3] pyiter
    @ [~/.julia/packages/PythonCall/wXfah/src/abstract/iter.jl:6](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/PythonCall/wXfah/src/abstract/iter.jl:6) [inlined]
  [4] iterate
    @ [~/.julia/packages/PythonCall/wXfah/src/Py.jl:328](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/PythonCall/wXfah/src/Py.jl:328) [inlined]
  [5] indexed_iterate(I::PythonCall.Py, i::Int64)
    @ Base [./tuple.jl:95](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/tuple.jl:95)
  [6] #save#13
    @ [~/.julia/packages/CatBoost/TiqIz/src/mlj_serialization.jl:44](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/CatBoost/TiqIz/src/mlj_serialization.jl:44) [inlined]
  [7] save(::CatBoostRegressor, fr::PythonCall.Py)
    @ CatBoost.MLJCatBoostInterface [~/.julia/packages/CatBoost/TiqIz/src/mlj_serialization.jl:43](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/CatBoost/TiqIz/src/mlj_serialization.jl:43)
  [8] serializable(mach::Machine{CatBoostRegressor, true}, model::CatBoostRegressor; verbosity::Int64)
    @ MLJBase [~/.julia/packages/MLJBase/eCnWm/src/machines.jl:988](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/MLJBase/eCnWm/src/machines.jl:988)
  [9] serializable (repeats 2 times)
    @ [~/.julia/packages/MLJBase/eCnWm/src/machines.jl:978](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/MLJBase/eCnWm/src/machines.jl:978) [inlined]
 [10] save(file::String, mach::Machine{CatBoostRegressor, true})
    @ MLJBase [~/.julia/packages/MLJBase/eCnWm/src/machines.jl:1082](https://vscode-remote+ssh-002dremote-002bivosrv6.vscode-resource.vscode-cdn.net/home/lyx/projects/rules/~/.julia/packages/MLJBase/eCnWm/src/machines.jl:1082)
 [11] top-level scope
    @ In[23]:1

It seems the error is due to the following line in mlj_serialization.jl

 (booster, a_target_element) = fr
ablaom commented 3 months ago

This is very likely fixed if you have the very latest version of MLJFlow.jl (0.4.1).

Okay, maybe not; you're not using MLJ. I'll take a look soon.

tylerjthomas9 commented 3 months ago
julia> using CatBoost.MLJCatBoostInterface

julia> using DataFrames

julia> using MLJBase

       # Initialize data

julia> train_data = DataFrame([[1, 4, 30], [4, 5, 40], [5, 6, 50], [6, 7, 60]], :auto)
3×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      4      5      6
   2 │     4      5      6      7
   3 │    30     40     50     60

julia> train_labels = [10.0, 20.0, 30.0]
3-element Vector{Float64}:
 10.0
 20.0
 30.0

julia> eval_data = DataFrame([[2, 1], [4, 4], [6, 50], [8, 60]], :auto)

       # Initialize CatBoostClassifier
2×4 DataFrame
 Row │ x1     x2     x3     x4
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     2      4      6      8
   2 │     1      4     50     60

julia> model = CatBoostRegressor(; iterations=2, learning_rate=1.0, depth=2)
CatBoostRegressor(
  iterations = 2,
  learning_rate = 1.0,
  depth = 2,
  l2_leaf_reg = 3.0,
  model_size_reg = 0.5,
  rsm = 1.0,
  loss_function = "RMSE",
  border_count = nothing,
  feature_border_type = nothing,
  per_float_feature_quantization = nothing,
  input_borders = nothing,
  output_borders = nothing,
  fold_permutation_block = 1,
  nan_mode = "Min",
  counter_calc_method = "SkipTest",
  leaf_estimation_iterations = nothing,
  leaf_estimation_method = nothing,
  thread_count = -1,
  random_seed = nothing,
  metric_period = 1,
  ctr_leaf_count_limit = nothing,
  store_all_simple_ctr = false,
  max_ctr_complexity = nothing,
  has_time = false,
  allow_const_label = false,
  target_border = nothing,
  one_hot_max_size = nothing,
  random_strength = 1.0,
  custom_metric = nothing,
  bagging_temperature = 1.0,
  fold_len_multiplier = 2.0,
  used_ram_limit = nothing,
  gpu_ram_part = 0.95,
  pinned_memory_size = 1073741824,
  allow_writing_files = nothing,
  approx_on_full_history = false,
  boosting_type = nothing,
  simple_ctr = nothing,
  combinations_ctr = nothing,
  per_feature_ctr = nothing,
  ctr_target_border_count = nothing,
  task_type = nothing,
  devices = nothing,
  bootstrap_type = nothing,
  subsample = nothing,
  sampling_frequency = "PerTreeLevel",
  sampling_unit = "Object",
  gpu_cat_features_storage = "GpuRam",
  data_partition = nothing,
  early_stopping_rounds = nothing,
  grow_policy = "SymmetricTree",
  min_data_in_leaf = 1,
  max_leaves = 31,
  leaf_estimation_backtracking = "AnyImprovement",
  feature_weights = nothing,
  penalties_coefficient = 1.0,
  model_shrink_rate = nothing,
  model_shrink_mode = "Constant",
  langevin = false,
  diffusion_temperature = 10000.0,
  posterior_sampling = false,
  boost_from_average = nothing,
  text_processing = nothing)

julia> mach = machine(model, train_data, train_labels)

       # Fit model
untrained Machine; caches model-specific representations of data
  model: CatBoostRegressor(iterations = 2, …)
  args:
    1:  Source @487 ⏎ Table{AbstractVector{Count}}
    2:  Source @087 ⏎ AbstractVector{Continuous}

julia> MLJBase.fit!(mach)

       # Get predictions
[ Info: Training machine(CatBoostRegressor(iterations = 2, …), …).
trained Machine; caches model-specific representations of data
  model: CatBoostRegressor(iterations = 2, …)
  args:
    1:  Source @487 ⏎ Table{AbstractVector{Count}}
    2:  Source @087 ⏎ AbstractVector{Continuous}

julia> preds_class = MLJBase.predict(mach, eval_data)
2-element Vector{Float64}:
 15.625
 18.125

julia> serializable_fitresult = MLJBase.save(mach, mach.fitresult)
Python: <catboost.core.CatBoostRegressor object at 0x7e27fff9e1e0>

julia> restored_fitresult = MLJBase.restore(mach, serializable_fitresult)
Python: <catboost.core.CatBoostRegressor object at 0x7e27fff9e1e0>

The MMI.save and MMI.restore functions work with the Machine's fitresult. I can look at adding support for serializing the entire Machine. It was set up like this because we have to use the catboost's Python interface to save/load the models.

ablaom commented 3 months ago

In case it's relevant: https://juliaai.github.io/MLJModelInterface.jl/dev/serialization/