Closed carolinaecalderon closed 2 months ago
Name | Link |
---|---|
Latest commit | 6d109be8f86d26c90c26ff5d8c57d389f0576e54 |
Latest deploy log | https://app.netlify.com/sites/determined-ui/deploys/66996394bd2d4f0008490e6f |
Attention: Patch coverage is 83.33333%
with 1 line
in your changes missing coverage. Please review.
Project coverage is 53.42%. Comparing base (
e4a9ae3
) to head (6d109be
). Report is 14 commits behind head on main.
When a cluster restarts, it restarts running trials
always? why? is this because we're not handling missed metrics, and progress reports?
Not sure what you mean -- In the case of a long-running hp experiment, when a cluster goes down/restarts in the middle of that progress, uncompleted trials are re-allocated/re-started. I'm not sure what you mean by missed metrics/progress reports. I see that the trials restart through the master service logs & also on the webUI.
Ticket
RM-368
Description
When a cluster restarts, it restarts running trials. For large experiments with several trials (like a hyperparameter experiment), some of these restored trails end up in STOPPING_COMPLETED indefinitely versus COMPLETED state. Fix this bug.
I was able to reproduce this on AWS by killing the agent service, stopping the master service, and then restarting them both. WIth my fix, the experiments naturally resolved themselves.
Test Plan
See new e2e test. No additional testing needed.
Checklist
docs/release-notes/
See Release Note for details.