Closed pebeto closed 2 months ago
Attention: Patch coverage is 90.00000%
with 1 lines
in your changes are missing coverage. Please review.
Project coverage is 87.55%. Comparing base (
bb59cae
) to head (2b63fa8
).
Files | Patch % | Lines |
---|---|---|
src/tuned_models.jl | 90.00% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Looking good, thanks!
Does it all look good on the MLflow service when fitting a TunedModel(model, logger=MLFlowLogger(...), ...)
?
Looking good locally. I've just uploaded the TunedModel
test case JuliaAI/MLJFlow.jl@2153b693ba2dcfb09399ea43614485bbef6d3146
Played around with this some more. Very cool, thanks!
However, there is a problem running in multithread mode. It seem only one thread is logging:
using MLJ
using .Threads
using MLFlowClient
nthreads()
# 5
logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="horse")
X, y = make_moons()
model = (@load RandomForestClassifier pkg=DecisionTree)()
r = range(model, :sampling_fraction, lower=0.4, upper=1.0)
tmodel = TunedModel(
model;
range=r,
logger,
acceleration=CPUThreads(),
n=100,
)
mach = machine(tmodel, X, y) |> fit!;
nruns = length(report(mach).history)
# 100
service = MLJFlow.service(logger)
experiment = MLFlowClient.getexperiment(service, "horse")
id = experiment.experiment_id
runs = MLFlowClient.searchruns(service, id);
length(runs)
# 20
@assert length(runs) == nruns
# ERROR: AssertionError: length(runs) == nruns
# Stacktrace:
# [1] top-level scope
# @ REPL[166]:1
The problem is we are missing logger
in the cloning of the resampling machine happening here:
I think CPUProcesses
should be fine, but we should add a test for this at MLJFlow.jl (and for CPUThreads
).
Thanks for the addition. Sadly, this is still not working for me. I'm getting three experiments, with different id's and same name, "horse" on the server. (I'm only expecting one). One contains 20 evaluations, the other two contains only 1 each, and this complaint is thrown several times:
{"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'horse' already exists."}""")
Do you have any idea what is happening?
Interestingly, I'm getting the same kind of error for acceleration=Distributed
:
using Distributed
addprocs(2)
nprocs()
# 3
using MLJ
using MLFlowClient
logger = MLFlowLogger("http://127.0.0.1:5000", experiment_name="rock")
X, y = make_moons()
model = (@iload RandomForestClassifier pkg=DecisionTree)()
r = range(model, :sampling_fraction, lower=0.4, upper=1.0)
tmodel = TunedModel(
model;
range=r,
logger,
acceleration=CPUProcesses(),
n=100,
)
mach = machine(tmodel, X, y) |> fit!;
Okay, see here for a MWE: https://github.com/JuliaAI/MLFlowClient.jl/issues/40
Revisiting this issue after a few months.
It looks like the multithreading issue is not likely to be addressed soon. Perhaps we can proceed with this PR, after strictly ruling out logging for the parallel modes. For example, if logger
is different from nothing
, and either acceleration
or acceleration_resampling
are different from CPU1()
, then clean!
resets the accelerations to CPU()
and issues a message saying what it has done and why. The clean!
code is here.
@pebeto What do you think?
The solution to this issue is not part of the mlflow
plans (see https://github.com/mlflow/mlflow/issues/11122). However, a workaround is presented here: https://github.com/JuliaAI/MLJFlow.jl/pull/36 to ensure our process is thread-safe.
Details in alan-turing-institute/MLJ.jl#1029.
L
for loggers (detailed implementation in MLJBase.jl).