fix: resolve indefinitely queued (STOPPING_COMPLETED) trials

carolinaecalderon commented 3 months ago

Ticket

RM-368

Description

When a cluster restarts, it restarts running trials. For large experiments with several trials (like a hyperparameter experiment), some of these restored trails end up in STOPPING_COMPLETED indefinitely versus COMPLETED state. Fix this bug.

I was able to reproduce this on AWS by killing the agent service, stopping the master service, and then restarting them both. WIth my fix, the experiments naturally resolved themselves.

Test Plan

See new e2e test. No additional testing needed.

Checklist

[ ] Changes have been manually QA'd
[ ] New features have been approved by the corresponding PM
[ ] User-facing API changes have the "User-facing API Change" label
[ ] Release notes have been added as a separate file under docs/release-notes/ See Release Note for details.
[ ] Licenses have been included for new code which was copied and/or modified from any external code

netlify[bot] commented 3 months ago

Deploy Preview for determined-ui canceled.

Name	Link
Latest commit	6d109be8f86d26c90c26ff5d8c57d389f0576e54
Latest deploy log	https://app.netlify.com/sites/determined-ui/deploys/66996394bd2d4f0008490e6f

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 53.42%. Comparing base (e4a9ae3) to head (6d109be). Report is 14 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #9605 +/- ## ========================================== - Coverage 53.44% 53.42% -0.02% ========================================== Files 1254 1254 Lines 152636 152633 -3 Branches 3268 3267 -1 ========================================== - Hits 81572 81548 -24 - Misses 70913 70934 +21 Partials 151 151 ``` | [Flag](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | Coverage Δ | | |---|---|---| | [backend](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `44.69% <83.33%> (-0.05%)` | :arrow_down: | | [harness](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `72.84% <ø> (ø)` | | | [web](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `51.81% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/determined-ai/determined/pull/9605?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | Coverage Δ | | |---|---|---| | [master/internal/trial.go](https://app.codecov.io/gh/determined-ai/determined/pull/9605?src=pr&el=tree&filepath=master%2Finternal%2Ftrial.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai#diff-bWFzdGVyL2ludGVybmFsL3RyaWFsLmdv) | `42.10% <83.33%> (+0.24%)` | :arrow_up: | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/determined-ai/determined/pull/9605/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai)

carolinaecalderon commented 2 months ago

When a cluster restarts, it restarts running trials

always? why? is this because we're not handling missed metrics, and progress reports?

Not sure what you mean -- In the case of a long-running hp experiment, when a cluster goes down/restarts in the middle of that progress, uncompleted trials are re-allocated/re-started. I'm not sure what you mean by missed metrics/progress reports. I see that the trials restart through the master service logs & also on the webUI.

determined-ai / determined