determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3k stars 350 forks source link

fix: resolve indefinitely queued (STOPPING_COMPLETED) trials #9605

Closed carolinaecalderon closed 2 months ago

carolinaecalderon commented 3 months ago

Ticket

RM-368

Description

When a cluster restarts, it restarts running trials. For large experiments with several trials (like a hyperparameter experiment), some of these restored trails end up in STOPPING_COMPLETED indefinitely versus COMPLETED state. Fix this bug.

I was able to reproduce this on AWS by killing the agent service, stopping the master service, and then restarting them both. WIth my fix, the experiments naturally resolved themselves.

Test Plan

See new e2e test. No additional testing needed.

Checklist

netlify[bot] commented 3 months ago

Deploy Preview for determined-ui canceled.

Name Link
Latest commit 6d109be8f86d26c90c26ff5d8c57d389f0576e54
Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66996394bd2d4f0008490e6f
codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 53.42%. Comparing base (e4a9ae3) to head (6d109be). Report is 14 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #9605 +/- ## ========================================== - Coverage 53.44% 53.42% -0.02% ========================================== Files 1254 1254 Lines 152636 152633 -3 Branches 3268 3267 -1 ========================================== - Hits 81572 81548 -24 - Misses 70913 70934 +21 Partials 151 151 ``` | [Flag](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | Coverage Δ | | |---|---|---| | [backend](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `44.69% <83.33%> (-0.05%)` | :arrow_down: | | [harness](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `72.84% <ø> (ø)` | | | [web](https://app.codecov.io/gh/determined-ai/determined/pull/9605/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `51.81% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/determined-ai/determined/pull/9605?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | Coverage Δ | | |---|---|---| | [master/internal/trial.go](https://app.codecov.io/gh/determined-ai/determined/pull/9605?src=pr&el=tree&filepath=master%2Finternal%2Ftrial.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai#diff-bWFzdGVyL2ludGVybmFsL3RyaWFsLmdv) | `42.10% <83.33%> (+0.24%)` | :arrow_up: | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/determined-ai/determined/pull/9605/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai)
carolinaecalderon commented 2 months ago

When a cluster restarts, it restarts running trials

always? why? is this because we're not handling missed metrics, and progress reports?

Not sure what you mean -- In the case of a long-running hp experiment, when a cluster goes down/restarts in the middle of that progress, uncompleted trials are re-allocated/re-started. I'm not sure what you mean by missed metrics/progress reports. I see that the trials restart through the master service logs & also on the webUI.