determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3.04k stars 356 forks source link

🐛[bug] Experiments fails after running for a week #9856

Closed hilvi closed 3 weeks ago

hilvi commented 2 months ago

Describe the bug

After running experiment for a week the experiments fail with the following error:

[2024-08-05 14:54:37] [8b3f458a] [rank=0] Traceback (most recent call last):
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_trainer.py", line 310, in init
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     yield context
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/harness.py", line 177, in _run_pytorch_trial
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     trainer.fit(
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_trainer.py", line 203, in fit
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     trial_controller.run()
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._run()
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._train_for_op(
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 775, in _train_for_op
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._report_searcher_progress(op, self.searcher_unit)
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 521, in _report_searcher_progress
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     op.report_progress(self.state.batches_trained)
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/core/_searcher.py", line 87, in report_progress
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._session.post(
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 212, in post
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     return self._do_request("POST", path, params, json, data, headers, timeout, False)
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 173, in _do_request
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     raise errors.UnauthenticatedException()
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] determined.common.api.errors.UnauthenticatedException: Unauthenticated: Please use 'det user login <username>' for password login, or for Enterprise users logging in with an SSO provider, use 'det auth login --provider=<provider>'.

The automatic retries will also fail:

[2024-08-05 14:58:32]
[d2bcf554] Traceback (most recent call last): <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main <none> [2024-08-05 14:58:32]
[d2bcf554]     return _run_code(code, main_globals, None, <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code <none> [2024-08-05 14:58:32]
[d2bcf554]     exec(code, run_globals) <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/prep_container.py", line 324, in <module> <none> [2024-08-05 14:58:32]
[d2bcf554]     download_context_directory(sess, info) <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/prep_container.py", line 29, in download_context_directory <none> [2024-08-05 14:58:32]
[d2bcf554]     b64_tgz = bindings.get_GetTaskContextDirectory(sess, taskId=info.task_id).b64Tgz <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/bindings.py", line 19363, in get_GetTaskContextDirectory <none> [2024-08-05 14:58:32]
[d2bcf554]     _resp = session._do_request( <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 173, in _do_request <none> [2024-08-05 14:58:32]
[d2bcf554]     raise errors.UnauthenticatedException() <none> [2024-08-05 14:58:32]
[d2bcf554] determined.common.api.errors.UnauthenticatedException: Unauthenticated: Please use 'det user login <username>' for password login, or for Enterprise users logging in with an SSO provider, use 'det auth login --provider=<provider>'. 

I have not looked too deeply but could be related to following refactor: https://github.com/determined-ai/determined/pull/8347

And the session duration set at: https://github.com/determined-ai/determined/blob/3a91552ac34095fac1f493c1bd7c72f849cd0e28/master/internal/user/postgres_users.go#L24

After forking the failed experiment it will run again without issues with authentication, for a week.

Reproduction Steps

  1. Create long running experiment
  2. Let it run for a week
  3. Experiment fails with UnauthenticatedException

Expected Behavior

Experiment should continue running without exception.

Screenshot

-

Environment

Determined version 0.33.0

Additional Context

No response

ioga commented 2 months ago

thank you for the report. we believe it is a regression, and we'll try to address it as soon as possible.

ioga commented 2 months ago

9860

stoksc commented 3 weeks ago

thanks again for this report! the fix was shipped in 0.37.0 so i'm going to close the issue for now but please let us know if you run into anything else.