JuliaAI / MLJTuning.jl

Hyperparameter optimization algorithms for use in the MLJ machine learning framework
MIT License
66 stars 12 forks source link

Adding loggers into TunedModels #193

Closed pebeto closed 2 months ago

pebeto commented 10 months ago

Details in alan-turing-institute/MLJ.jl#1029.

codecov[bot] commented 10 months ago

Codecov Report

Attention: Patch coverage is 90.00000% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 87.55%. Comparing base (bb59cae) to head (2b63fa8).

Files Patch % Lines
src/tuned_models.jl 90.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #193 +/- ## ========================================== + Coverage 87.53% 87.55% +0.01% ========================================== Files 13 13 Lines 666 667 +1 ========================================== + Hits 583 584 +1 Misses 83 83 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ablaom commented 10 months ago

Looking good, thanks!

Does it all look good on the MLflow service when fitting a TunedModel(model, logger=MLFlowLogger(...), ...)?

pebeto commented 10 months ago

Looking good locally. I've just uploaded the TunedModel test case JuliaAI/MLJFlow.jl@2153b693ba2dcfb09399ea43614485bbef6d3146

ablaom commented 10 months ago

Played around with this some more. Very cool, thanks!

However, there is a problem running in multithread mode. It seem only one thread is logging:

using MLJ
using .Threads
using MLFlowClient
nthreads()
# 5

logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="horse")
X, y = make_moons()
model = (@load RandomForestClassifier pkg=DecisionTree)()

r = range(model, :sampling_fraction, lower=0.4, upper=1.0)

tmodel = TunedModel(
    model;
    range=r,
    logger,
    acceleration=CPUThreads(),
    n=100,
)

mach = machine(tmodel, X, y) |> fit!;
nruns = length(report(mach).history)
# 100

service = MLJFlow.service(logger)
experiment = MLFlowClient.getexperiment(service, "horse")
id = experiment.experiment_id
runs = MLFlowClient.searchruns(service, id);
length(runs)
# 20

@assert length(runs) == nruns
# ERROR: AssertionError: length(runs) == nruns
# Stacktrace:
#  [1] top-level scope
#    @ REPL[166]:1
ablaom commented 10 months ago

The problem is we are missing logger in the cloning of the resampling machine happening here:

https://github.com/pebeto/MLJTuning.jl/blob/6f295b7439a9884fa35c16841ded33db2d272227/src/tuned_models.jl#L590

ablaom commented 10 months ago

I think CPUProcesses should be fine, but we should add a test for this at MLJFlow.jl (and for CPUThreads).

ablaom commented 10 months ago

Thanks for the addition. Sadly, this is still not working for me. I'm getting three experiments, with different id's and same name, "horse" on the server. (I'm only expecting one). One contains 20 evaluations, the other two contains only 1 each, and this complaint is thrown several times:

    {"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'horse' already exists."}""")

Do you have any idea what is happening?

ERROR: TaskFailedException nested task error: HTTP.Exceptions.StatusError(400, "POST", "/api/2.0/mlflow/experiments/create", HTTP.Messages.Response: """ HTTP/1.1 400 Bad Request Server: gunicorn Date: Sun, 24 Sep 2023 19:40:45 GMT Connection: close Content-Type: application/json Content-Length: 90 {"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'horse' already exists."}""") Stacktrace: [1] mlfpost(mlf::MLFlow, endpoint::String; kwargs::Base.Pairs{Symbol, Union{Missing, Nothing, String}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:name, :artifact_location, :tags), Tuple{String, Nothing, Missing}}}) @ MLFlowClient ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:74 [2] mlfpost @ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:66 [inlined] [3] createexperiment(mlf::MLFlow; name::String, artifact_location::Nothing, tags::Missing) @ MLFlowClient ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:21 [4] createexperiment @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:16 [inlined] [5] #getorcreateexperiment#7 @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:103 [inlined] [6] log_evaluation(logger::MLFlowLogger, performance_evaluation::PerformanceEvaluation {MLJDecisionTreeInterface.RandomForestClassifier, Vector{LogLoss{Float64}}, Vector{Float64}, Vector{typeof(predict)}, Vector{Vector{Float64}}, Vector{Vector{Vector{Float64}}}, Vector{NamedTuple{(:forest,), Tuple{DecisionTree.Ensemble{Float64, UInt32}}}}, Vector{NamedTuple{(:features,), Tuple{Vector{Symbol}}}}, Holdout}) @ MLJFlow ~/.julia/packages/MLJFlow/TqEtw/src/base.jl:2 [7] evaluate!(mach::Machine{MLJDecisionTreeInterface.RandomForestClassifier, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{LogLoss{Float64}}, operations::Vector{typeof(predict)}, acceleration::CPU1{Nothing}, force::Bool, logger::MLFlowLogger, user_resampling::Holdout) @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1314 [8] evaluate!(::Machine{MLJDecisionTreeInterface.RandomForestClassifier, true}, ::Holdout, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{LogLoss{Float64}}, ::Vector {typeof(predict)}, ::CPU1{Nothing}, ::Bool, ::MLFlowLogger, ::Holdout) @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1335 [9] fit(::Resampler{Holdout, MLFlowLogger}, ::Int64, ::Tables.MatrixTable{Matrix{Float64}}, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}}) @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1494 [10] fit_only!(mach::Machine{Resampler{Holdout, MLFlowLogger}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing) @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680 [11] fit_only! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined] [12] #fit!#63 @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined] [13] fit! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined] [14] event!(metamodel::MLJDecisionTreeInterface.RandomForestClassifier, resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false}, verbosity::Int64, tuning::RandomSearch, history::Nothing, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}) @ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:443 [15] #46 @ ~/MLJ/MLJTuning/src/tuned_models.jl:597 [inlined] [16] iterate @ ./generator.jl:47 [inlined] [17] _collect(c::Vector{MLJDecisionTreeInterface.RandomForestClassifier}, itr::Base.Generator{Vector{MLJDecisionTreeInterface.RandomForestClassifier}, MLJTuning.var"#46#50"{Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, Channel{Bool}, Vector{Machine{Resampler{Holdout, MLFlowLogger}, false}}, Int64}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1}) @ Base ./array.jl:802 [18] collect_similar @ ./array.jl:711 [inlined] [19] map @ ./abstractarray.jl:3261 [inlined] [20] macro expansion @ ~/MLJ/MLJTuning/src/tuned_models.jl:596 [inlined] [21] (::MLJTuning.var"#45#49"{Vector{MLJDecisionTreeInterface.RandomForestClassifier}, Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, Channel{Bool}, Vector{Any}, Vector{Machine{Resampler{Holdout, MLFlowLogger}, false}}, UnitRange{Int64}, Int64})() @ MLJTuning ./threadingconstructs.jl:373
ablaom commented 10 months ago

Interestingly, I'm getting the same kind of error for acceleration=Distributed:

using Distributed
addprocs(2)

nprocs()
# 3

using MLJ
using MLFlowClient
logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="rock")

X, y = make_moons()
model = (@iload RandomForestClassifier pkg=DecisionTree)()

r = range(model, :sampling_fraction, lower=0.4, upper=1.0)

tmodel = TunedModel(
    model;
    range=r,
    logger,
    acceleration=CPUProcesses(),
    n=100,
)

mach = machine(tmodel, X, y) |> fit!;
``` [ Info: Training machine(ProbabilisticTunedModel(model = RandomForestClassifier(max_depth = -1, …), …), …). [ Info: Attempting to evaluate 100 models. From worker 3: ┌ Error: Problem fitting the machine machine(Resampler(model = RandomForestClassifier(max_depth = -1, …), …), …). From worker 3: └ @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:682 From worker 3: [ Info: Running type checks... From worker 3: [ Info: Type checks okay. Evaluating over 100 metamodels: 50%[============> ] ETA: 0:00:15┌ Error: Proble m fitting the machine machine(ProbabilisticTunedModel(model = RandomForestClassifier(max_depth = -1, …), …), …). └ @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:682 [ Info: Running type checks... [ Info: Type checks okay. ERROR: TaskFailedException Stacktrace: [1] wait @ ./task.jl:349 [inlined] [2] fetch @ ./task.jl:369 [inlined] [3] preduce(reducer::Function, f::Function, R::Vector{MLJDecisionTreeInterface.RandomForestClassifier}) @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:274 [4] macro expansion @ ~/MLJ/MLJTuning/src/tuned_models.jl:521 [inlined] [5] macro expansion @ ./task.jl:476 [inlined] [6] assemble_events!(metamodels::Vector{MLJDecisionTreeInterface.RandomForestClassifier}, resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false}, verbosity::Int64, tuning::RandomSearch, history::Nothing, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, acceleration::CPUProcesses{Nothing}) @ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:502 [7] build!(history::Nothing, n::Int64, tuning::RandomSearch, model::MLJDecisionTreeInterface.RandomForestClassifier, model_buffer::Channel{Any}, state::Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, verbosity::Int64, acceleration::CPUProcesses{Nothing}, resampling_machine::Machine{Resampler{Holdout, MLFlowLogger}, false}) @ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:675 [8] fit(::MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, ::Int64, ::Tables.MatrixTable{Matrix{Float64}}, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}}) @ MLJTuning ~/MLJ/MLJTuning/src/tuned_models.jl:756 [9] fit_only!(mach::Machine{MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing) @ MLJBase ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680 [10] fit_only! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined] [11] #fit!#63 @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined] [12] fit! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined] [13] |>(x::Machine{MLJTuning.ProbabilisticTunedModel{RandomSearch, MLJDecisionTreeInterface.RandomForestClassifier, MLFlowLogger}, false}, f::typeof(fit!)) @ Base ./operators.jl:907 [14] top-level scope @ REPL[16]:1 nested task error: On worker 3: HTTP.Exceptions.StatusError(400, "POST", "/api/2.0/mlflow/experiments/create", HTTP.Messages.Response: """ HTTP/1.1 400 Bad Request Server: gunicorn Date: Sun, 24 Sep 2023 20:07:23 GMT Connection: close Content-Type: application/json Content-Length: 89 {"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'rock' already exists."}""") Stacktrace: [1] #mlfpost#3 @ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:74 [2] mlfpost @ ~/.julia/packages/MLFlowClient/Szkbv/src/utils.jl:66 [inlined] [3] #createexperiment#6 @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:21 [4] createexperiment @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:16 [inlined] [5] #getorcreateexperiment#7 @ ~/.julia/packages/MLFlowClient/Szkbv/src/experiments.jl:103 [inlined] [6] log_evaluation @ ~/.julia/packages/MLJFlow/TqEtw/src/base.jl:2 [7] evaluate! @ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1314 [8] evaluate! @ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1335 [9] fit @ ~/.julia/packages/MLJBase/ByFwA/src/resampling.jl:1494 [10] #fit_only!#57 @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:680 [11] fit_only! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:606 [inlined] [12] #fit!#63 @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:778 [inlined] [13] fit! @ ~/.julia/packages/MLJBase/ByFwA/src/machines.jl:775 [inlined] [14] event! @ ~/MLJ/MLJTuning/src/tuned_models.jl:443 [15] macro expansion @ ~/MLJ/MLJTuning/src/tuned_models.jl:522 [inlined] [16] #39 @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:288 [17] #invokelatest#2 @ ./essentials.jl:816 [18] invokelatest @ ./essentials.jl:813 [19] #110 @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285 [20] run_work_thunk @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70 [21] macro expansion @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285 [inlined] [22] #109 @ ./task.jl:514 Stacktrace: [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}) @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:465 [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Function, ::Vararg{Any}) @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454 [3] #remotecall_fetch#162 @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined] [4] remotecall_fetch @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined] [5] (::Distributed.var"#175#176"{typeof(vcat), MLJTuning.var"#39#42"{Machine{Resampler{Holdout, MLFlowLogger}, false}, Int64, RandomSearch, Nothing, Vector{Tuple{Symbol, MLJBase.NumericSampler{Float64, Distributions.Uniform{Float64}, Symbol}}}, RemoteChannel{Channel{Bool}}}, Vector{MLJDecisionTreeInterface.RandomForestClassifier}, Vector{UnitRange{Int64}}, Int64, Int64})() @ Distributed /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Distributed/src/macros.jl:270 ```
ablaom commented 10 months ago

Okay, see here for a MWE: https://github.com/JuliaAI/MLFlowClient.jl/issues/40

ablaom commented 6 months ago

Revisiting this issue after a few months.

It looks like the multithreading issue is not likely to be addressed soon. Perhaps we can proceed with this PR, after strictly ruling out logging for the parallel modes. For example, if logger is different from nothing, and either acceleration or acceleration_resampling are different from CPU1(), then clean! resets the accelerations to CPU() and issues a message saying what it has done and why. The clean! code is here.

@pebeto What do you think?

pebeto commented 4 months ago

The solution to this issue is not part of the mlflow plans (see https://github.com/mlflow/mlflow/issues/11122). However, a workaround is presented here: https://github.com/JuliaAI/MLJFlow.jl/pull/36 to ensure our process is thread-safe.