ecmwf / anemoi-training

Apache License 2.0
17 stars 17 forks source link

Increase MLflow HTTP retry #111

Closed gmertes closed 19 hours ago

gmertes commented 3 weeks ago

Increase the amount that mlflow will retry and wait for HTTP requests if the server is unavailable. This increases the window during which the server can go offline, without the runs crashing.

Currently, the default values are:

MLFLOW_HTTP_REQUEST_MAX_RETRIES = 5
MLFLOW_HTTP_REQUEST_TIMEOUT = 120

I need to do some testing for which values make sense. There is an exponential backoff mechanism, but also a maximum backoff. Ideally I want to set these so it allows for several hours of server down time, to deal with system sessions.

I fetch them from the config object with a default value so that they can be set there, but I don't want to add them to the default config because ideally they should not be set by the user. So for power-users/debugging, I make them configurable, but the regular user doesn't see them.

Mitigates #110 , but not a real solution.

gmertes commented 1 week ago

After some testing, I found that 35 retries corresponds to a retry time of 1 hour. I confirmed that training will hang for up to 1 hour, and if the server comes back within that time training will resume. If it doesn't, it will crash.

I chose 1 hour arbitrarily, that can still be changed.

I did not change the http timeout value, since the default is sufficient. If our mlflow server goes down, it will return an error status code triggering a retry. So increasing the timeout time will not do anything for us.