determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.93k stars 347 forks source link

ci: extend experiment timeout for slurm test #9601

Closed MikhailKardash closed 2 weeks ago

MikhailKardash commented 2 weeks ago

Ticket

None

Description

Slurm restart fails on main because the underlying trials time out due to image pull. This PR does 2 things:

  1. Bypasses top-level config for trial timeout in the affected test to wait for image pulls.
  2. Adds the affected test suite to be testable on feature branches.

Test Plan

CI passes, specifically: test-e2e-slurm-restart

Checklist

netlify[bot] commented 2 weeks ago

Deploy Preview for determined-ui canceled.

Name Link
Latest commit cb3eda327ac776d04d0d1aeaca8f2a53543a8721
Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/66857b801d8dac00086f36a4
codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 51.64%. Comparing base (2dc59ca) to head (cb3eda3). Report is 4 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #9601 +/- ## ======================================= Coverage 51.63% 51.64% ======================================= Files 1255 1255 Lines 152631 152631 Branches 3092 3091 -1 ======================================= + Hits 78815 78820 +5 + Misses 73659 73654 -5 Partials 157 157 ``` | [Flag](https://app.codecov.io/gh/determined-ai/determined/pull/9601/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | Coverage Δ | | |---|---|---| | [backend](https://app.codecov.io/gh/determined-ai/determined/pull/9601/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `43.96% <ø> (+<0.01%)` | :arrow_up: | | [harness](https://app.codecov.io/gh/determined-ai/determined/pull/9601/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `72.76% <ø> (ø)` | | | [web](https://app.codecov.io/gh/determined-ai/determined/pull/9601/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai) | `48.63% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai#carryforward-flags-in-the-pull-request-comment) to find out more. [see 2 files with indirect coverage changes](https://app.codecov.io/gh/determined-ai/determined/pull/9601/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=determined-ai)