Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
thanks again for this report! the fix was shipped in 0.37.0 so i'm going to close the issue for now but please let us know if you run into anything else.
Describe the bug
After running experiment for a week the experiments fail with the following error:
The automatic retries will also fail:
I have not looked too deeply but could be related to following refactor: https://github.com/determined-ai/determined/pull/8347
And the session duration set at: https://github.com/determined-ai/determined/blob/3a91552ac34095fac1f493c1bd7c72f849cd0e28/master/internal/user/postgres_users.go#L24
After forking the failed experiment it will run again without issues with authentication, for a week.
Reproduction Steps
Expected Behavior
Experiment should continue running without exception.
Screenshot
-
Environment
Determined version 0.33.0
Additional Context
No response