Closed ablaom closed 4 months ago
@deyandyankov Could this possibly originate from a limitation of MLFlowClient.jl or mlflow itself?
@ablaom was at least one experiment recorded? Seems to me that the first time evaluate
was spawned, an experiment was created (white moon), and then the other executions try to create an experiment with the same name, which mlflow
is denying:
{"error_code": "RESOURCE_ALREADY_EXISTS", "message": "Experiment 'white moon' already exists."}""")
There is no strict limitation in MLFlowClient
to explicitly disable concurrency as far as I can remember.
@deyandyankov Could it be related with the way mlflow
handles this pseudo-random names? Maybe they use the unix timestamp.
| was at least one experiment recorded?
Yes.
If we send a message to the service to create a new experiment, don t' we need to block logging until both the experiment is created and an experiment name is allocated? I don't see any blocking happening at present. https://docs.julialang.org/en/v1/manual/asynchronous-programming/#Communicating-with-Channels
Looks like mlflow does not (or at least at one point did not) support asynchronous actions: https://github.com/mlflow/mlflow/issues/1550#issuecomment-1024492066
In that case, we can generate the random names by ourselves, solving the naming problem you identified.
I'm not sure I follow. Perhaps I misunderstand the problem. Be great if you can post a PR to test your theory.
mlflow 2.8.0
was released with an experimental async logging for metrics, params and tags. Maybe we can take this again.
Thanks for flagging the update!
I'm guessing this won't "just work" and that we need to buy into some new messaging, or something?
This is not already fixed by mlflow
. Including the response you posted, I'm getting four experiments with the same name (that must be impossible).
I suggest that we can handle something like a queue in MLFlowClient
to avoid this kind of issues, or simply disallowing concurrency in our project. Below are the code we need to be aware of.
function mlfget(mlf, endpoint; kwargs...)
apiuri = uri(mlf, endpoint, kwargs)
apiheaders = headers(mlf, Dict("Content-Type" => "application/json"))
try
response = HTTP.get(apiuri, apiheaders)
return JSON.parse(String(response.body))
catch e
throw(e)
end
end
function mlfpost(mlf, endpoint; kwargs...)
apiuri = uri(mlf, endpoint)
apiheaders = headers(mlf, Dict("Content-Type" => "application/json"))
body = JSON.json(kwargs)
try
response = HTTP.post(apiuri, apiheaders, body)
return JSON.parse(String(response.body))
catch e
throw(e)
end
end
function getexperiment(mlf::MLFlow, experiment_id::Integer)
try
result = _getexperimentbyid(mlf, experiment_id)
return MLFlowExperiment(result)
catch e
if isa(e, HTTP.ExceptionRequest.StatusError) && e.status == 404
return missing
end
throw(e)
end
end
function createexperiment(mlf::MLFlow; name=missing, artifact_location=missing, tags=missing)
endpoint = "experiments/create"
if ismissing(name)
name = string(UUIDs.uuid4())
end
result = mlfpost(mlf, endpoint; name=name, artifact_location=artifact_location, tags=tags)
experiment_id = parse(Int, result["experiment_id"])
getexperiment(mlf, experiment_id)
end
function getorcreateexperiment(mlf::MLFlow, experiment_name::String; artifact_location=missing, tags=missing)
exp = getexperiment(mlf, experiment_name)
if ismissing(exp)
exp = createexperiment(mlf, name=experiment_name, artifact_location=artifact_location, tags=tags)
end
exp
end
I don't know if we have something like Python async
/await
in Julia. Do you know someone who can help us with that? @ablaom
Julia fully supports ansynchronous programming: https://docs.julialang.org/en/v1/manual/asynchronous-programming/
What you call a "queue" is called a Channel
.
I have asked @OkonSamuel to have a look into this. He has expertise in this area (but is also quite busy),
Adding more information:
mlflow
is not fully accepting async operations. I can't say this is completely true, but sometimes it reports three experiments with the same name. This must be impossible by its own documentation. This could be not something related to us, but can be solved using channels (not sure).
mlflow
is not fully accepting async operations
Did you mean "`mlflow is now fully accepting async operations?
The proposal referenced above may resolve this issue.
Closing it in favor of #41.
It seems one cannot log runs to a single experiment asynchronously: