dstackai / dstack

dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.4k stars 136 forks source link

[Bug]: Task Fails with Error Code INTERRUPTED_BY_NO_CAPACITY #1738

Open movchan74 opened 2 weeks ago

movchan74 commented 2 weeks ago

Steps to reproduce

  1. Clone repo: https://github.com/mobiusml/aana_sdk.git

  2. Install dstack: pip install dstack

  3. Configure dstack:

    dstack config --url https://sky.dstack.ai --project <project_name> --token <token>
  4. (Optional) Modify tests.dstack.yml to disable volume and GPU tests for faster runs:

    type: task
    name: aana-tests
    backends: [runpod]
    image: nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
    env:
     - HF_TOKEN
    commands:
     - apt-get update
     - DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata
     - apt-get install -y libgl1 libglib2.0-0 ffmpeg python3 python3-dev postgresql sudo
     - locale-gen en_US.UTF-8
     - export LANG="en_US.UTF-8" LANGUAGE="en_US:en" LC_ALL="en_US.UTF-8"
     - curl -sSL https://install.python-poetry.org | python3 -
     - export PATH=$PATH:/root/.local/bin
     - poetry install
     - HF_HUB_CACHE="/models_cache" CUDA_VISIBLE_DEVICES="" poetry run pytest -vv
    max_price: 1.0
    resources:
     cpu: 9..
     memory: 32GB..
     gpu: 40GB..
  5. Initialize dstack: dstack init

  6. Start the test run: HF_TOKEN="" dstack apply -f tests.dstack.yml

At the end of the run, you will see the following error message:

Run failed with error code INTERRUPTED_BY_NO_CAPACITY.

Actual behaviour

The test runs successfully, and all tests pass, but the task fails at the end with the error INTERRUPTED_BY_NO_CAPACITY. This occurs regardless of whether the tests pass or fail, which causes the GitHub Actions workflow to be marked as failed even though the tests are successful.

Sample error log:

aana/tests/units/test_whisper_params.py::test_whisper_params_invalid_temperature[temperature1] PASSED [ 98%]
aana/tests/units/test_whisper_params.py::test_whisper_params_invalid_temperature[invalid_temperature] PASSED [ 99%]
aana/tests/units/test_whisper_params.py::test_whisper_params_invalid_temperature[2] PASSED [100%]

=============================== warnings summary ===============================
../root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pyannote/core/notebook.py:134
  /root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pyannote/core/notebook.py:134: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed in 3.11. Use ``matplotlib.colormaps[name]`` or ``matplotlib.colormaps.get_cmap()`` or ``pyplot.get_cmap()`` instead.
    cm = get_cmap("Set1")

../root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pkg_resources/__init__.py:3154
../root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pkg_resources/__init__.py:3154
../root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pkg_resources/__init__.py:3154
../root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pkg_resources/__init__.py:3154
../root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pkg_resources/__init__.py:3154
  /root/.cache/pypoetry/virtualenvs/aana-M5oJUcis-py3.10/lib/python3.10/site-packages/pkg_resources/__init__.py:3154: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pyannote')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

aana/tests/db/datastore/test_caption_repo.py: 24 warnings
aana/tests/db/datastore/test_task_repo.py: 14 warnings
aana/tests/db/datastore/test_transcript_repo.py: 6 warnings
aana/tests/db/datastore/test_video_repo.py: 18 warnings
  /workflow/aana/storage/repository/base.py:59: LegacyAPIWarning: The Query.get() method is considered legacy as of the 1.x series of SQLAlchemy and becomes a legacy construct in 2.0. The method is now available as Session.get() (deprecated since: 2.0) (Background on SQLAlchemy 2.0 at: https://sqlalche.me/e/b8d9)
    entity: T | None = self.session.query(self.model_class).get(item_id)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========== 177 passed, 16 skipped, 68 warnings in 420.58s (0:07:00) ===========
Run failed with error code INTERRUPTED_BY_NO_CAPACITY.
Check CLI, server, and run logs for more details.

Expected behaviour

The task should be completed successfully if all the tests pass without encountering the INTERRUPTED_BY_NO_CAPACITY error. The GitHub Actions workflow should also reflect the successful test run without marking the task as failed due to the capacity error.be completed

dstack version

0.18.15

Server logs

Used dstack Sky with my own API keys.

Additional information

2024-09-27T15:18:29.4505488Z ================ 189 passed, 68 warnings in 1395.60s (0:23:15) =================
2024-09-27T15:18:48.3599503Z Run failed with error code INTERRUPTED_BY_NO_CAPACITY.
2024-09-27T15:18:48.3600357Z Check CLI, server, and run logs for more details.
2024-09-27T15:18:48.4605321Z ##[error]Process completed with exit code 1.
2024-09-27T15:18:48.4690032Z Post job cleanup.
jvstme commented 2 weeks ago

Relevant server logs:

{"message": "job(302f9f)aana-tests-0-0: now is RUNNING", "logger": "dstack._internal.server.background.tasks.process_running_jobs", "timestamp": "2024-09-29 19:23:00,931", "level": "INFO"}
{"message": "job(302f9f)aana-tests-0-0: failed because runner is not available or return an error,  age=0:13:46.156854", "logger": "dstack._internal.server.background.tasks.process_running_jobs", "timestamp": "2024-09-29 19:33:16,312", "level": "WARNING"}
{"message": "run(315b6c)aana-tests: run status has changed RUNNING -> TERMINATING", "logger": "dstack._internal.server.background.tasks.process_runs", "timestamp": "2024-09-29 19:33:16,699", "level": "INFO"}
{"message": "job(302f9f)aana-tests-0-0: instance 'aana-tests-0' has been released, new status is TERMINATING", "logger": "dstack._internal.server.services.jobs", "timestamp": "2024-09-29 19:33:17,962", "level": "INFO"}
{"message": "job(302f9f)aana-tests-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY", "logger": "dstack._internal.server.services.jobs", "timestamp": "2024-09-29 19:33:17,974", "level": "INFO"}

Looks like the container in RunPod exits before dstack-server collects the status of the job, which leads to lost runner connection and the error above.

jvstme commented 2 weeks ago

Final RunPod container logs just before the server to runner connection fails:

2024-09-30T20:53:54.569234473Z time=2024-09-30T20:53:54.568856Z level=debug status=200 method=GET endpoint=/api/pull
2024-09-30T20:54:00.506831266Z time=2024-09-30T20:54:00.506447Z level=debug method=GET endpoint=/api/pull status=200
2024-09-30T20:54:06.187753351Z time=2024-09-30T20:54:06.187541Z level=debug method=GET endpoint=/api/pull status=200
2024-09-30T20:54:09.924152129Z 2024/09/30 20:54:09 http: response.WriteHeader on hijacked connection from github.com/dstackai/dstack/runner/internal/runner/api.NewServer.JSONResponseHandler.func7 (common.go:121)
2024-09-30T20:54:09.924180966Z time=2024-09-30T20:54:09.924037Z level=debug status=200 method=GET endpoint=/logs_ws
2024-09-30T20:54:09.924208932Z 2024/09/30 20:54:09 http: response.Write on hijacked connection from fmt.Fprintln (print.go:305)
2024-09-30T20:54:11.634708550Z time=2024-09-30T20:54:11.63432Z level=debug method=GET endpoint=/api/pull status=200
2024-09-30T20:54:15.640658577Z time=2024-09-30T20:54:15.640078Z level=error msg=Exec failed err=[executor.go:249 executor.(*RunExecutor).execJob] exit status 1
2024-09-30T20:54:15.640728133Z time=2024-09-30T20:54:15.640337Z level=info msg=Job state changed new=failed
2024-09-30T20:54:15.640741297Z time=2024-09-30T20:54:15.640386Z level=error msg=Executor failed err=[executor.go:160 executor.(*RunExecutor).Run] [executor.go:249 executor.(*RunExecutor).execJob] exit status 1
2024-09-30T20:54:15.640756527Z time=2024-09-30T20:54:15.640664Z level=info msg=Job finished, shutting down
2024-09-30T20:54:15.646176638Z panic: close of closed channel
2024-09-30T20:54:15.646239915Z 
2024-09-30T20:54:15.646245262Z goroutine 696 [running]:
2024-09-30T20:54:15.646250372Z github.com/dstackai/dstack/runner/internal/runner/api.(*Server).streamJobLogs(0xc00022ad80, 0xc001132dc0)
2024-09-30T20:54:15.646255492Z  /home/runner/work/dstack/dstack/runner/internal/runner/api/ws.go:42 +0x14c
2024-09-30T20:54:15.646261001Z created by github.com/dstackai/dstack/runner/internal/runner/api.(*Server).logsWsGetHandler in goroutine 620
2024-09-30T20:54:15.646265452Z  /home/runner/work/dstack/dstack/runner/internal/runner/api/ws.go:24 +0x87

Looks like the container exits prematurely because of a panic in dstack-runner.

Note: the fact that the job fails with exit status 1 is expected here because some tests from the example repo fail with my HF_TOKEN. What is not expected is the dstack-runner panic.

How to retrieve these logs Set a breakpoint [here](https://github.com/dstackai/dstack/blob/f1f5fdcbd1129cbec52717e27a2220328c53b93c/src/dstack/_internal/core/backends/runpod/compute.py#L153) so that dstack-server does not delete the pod after losing the runner connection. Then reproduce the issue and see the container logs in RunPod console.
jvstme commented 2 weeks ago

So far I only managed to reproduce this with the full configuration from the aana_sdk repo, which takes about 30-40 minutes. The shorter configuration from step 4 works fine for me, as well as some other task configurations I've tried. Maybe the execution time is a factor here

r4victor commented 2 weeks ago

One hypothesis is that the panic is caused by concurrent executions of logsWsGetHandler(). In that case I assume the channel may be closed twice. But I'm not sure why concurrent executions of logsWsGetHandler() take place.