Open movchan74 opened 2 weeks ago
Relevant server logs:
{"message": "job(302f9f)aana-tests-0-0: now is RUNNING", "logger": "dstack._internal.server.background.tasks.process_running_jobs", "timestamp": "2024-09-29 19:23:00,931", "level": "INFO"}
{"message": "job(302f9f)aana-tests-0-0: failed because runner is not available or return an error, age=0:13:46.156854", "logger": "dstack._internal.server.background.tasks.process_running_jobs", "timestamp": "2024-09-29 19:33:16,312", "level": "WARNING"}
{"message": "run(315b6c)aana-tests: run status has changed RUNNING -> TERMINATING", "logger": "dstack._internal.server.background.tasks.process_runs", "timestamp": "2024-09-29 19:33:16,699", "level": "INFO"}
{"message": "job(302f9f)aana-tests-0-0: instance 'aana-tests-0' has been released, new status is TERMINATING", "logger": "dstack._internal.server.services.jobs", "timestamp": "2024-09-29 19:33:17,962", "level": "INFO"}
{"message": "job(302f9f)aana-tests-0-0: job status is FAILED, reason: INTERRUPTED_BY_NO_CAPACITY", "logger": "dstack._internal.server.services.jobs", "timestamp": "2024-09-29 19:33:17,974", "level": "INFO"}
Looks like the container in RunPod exits before dstack-server collects the status of the job, which leads to lost runner connection and the error above.
Final RunPod container logs just before the server to runner connection fails:
2024-09-30T20:53:54.569234473Z time=2024-09-30T20:53:54.568856Z level=debug status=200 method=GET endpoint=/api/pull
2024-09-30T20:54:00.506831266Z time=2024-09-30T20:54:00.506447Z level=debug method=GET endpoint=/api/pull status=200
2024-09-30T20:54:06.187753351Z time=2024-09-30T20:54:06.187541Z level=debug method=GET endpoint=/api/pull status=200
2024-09-30T20:54:09.924152129Z 2024/09/30 20:54:09 http: response.WriteHeader on hijacked connection from github.com/dstackai/dstack/runner/internal/runner/api.NewServer.JSONResponseHandler.func7 (common.go:121)
2024-09-30T20:54:09.924180966Z time=2024-09-30T20:54:09.924037Z level=debug status=200 method=GET endpoint=/logs_ws
2024-09-30T20:54:09.924208932Z 2024/09/30 20:54:09 http: response.Write on hijacked connection from fmt.Fprintln (print.go:305)
2024-09-30T20:54:11.634708550Z time=2024-09-30T20:54:11.63432Z level=debug method=GET endpoint=/api/pull status=200
2024-09-30T20:54:15.640658577Z time=2024-09-30T20:54:15.640078Z level=error msg=Exec failed err=[executor.go:249 executor.(*RunExecutor).execJob] exit status 1
2024-09-30T20:54:15.640728133Z time=2024-09-30T20:54:15.640337Z level=info msg=Job state changed new=failed
2024-09-30T20:54:15.640741297Z time=2024-09-30T20:54:15.640386Z level=error msg=Executor failed err=[executor.go:160 executor.(*RunExecutor).Run] [executor.go:249 executor.(*RunExecutor).execJob] exit status 1
2024-09-30T20:54:15.640756527Z time=2024-09-30T20:54:15.640664Z level=info msg=Job finished, shutting down
2024-09-30T20:54:15.646176638Z panic: close of closed channel
2024-09-30T20:54:15.646239915Z
2024-09-30T20:54:15.646245262Z goroutine 696 [running]:
2024-09-30T20:54:15.646250372Z github.com/dstackai/dstack/runner/internal/runner/api.(*Server).streamJobLogs(0xc00022ad80, 0xc001132dc0)
2024-09-30T20:54:15.646255492Z /home/runner/work/dstack/dstack/runner/internal/runner/api/ws.go:42 +0x14c
2024-09-30T20:54:15.646261001Z created by github.com/dstackai/dstack/runner/internal/runner/api.(*Server).logsWsGetHandler in goroutine 620
2024-09-30T20:54:15.646265452Z /home/runner/work/dstack/dstack/runner/internal/runner/api/ws.go:24 +0x87
Looks like the container exits prematurely because of a panic in dstack-runner.
Note: the fact that the job fails with exit status 1 is expected here because some tests from the example repo fail with my HF_TOKEN. What is not expected is the dstack-runner panic.
So far I only managed to reproduce this with the full configuration from the aana_sdk repo, which takes about 30-40 minutes. The shorter configuration from step 4 works fine for me, as well as some other task configurations I've tried. Maybe the execution time is a factor here
One hypothesis is that the panic is caused by concurrent executions of logsWsGetHandler()
. In that case I assume the channel may be closed twice. But I'm not sure why concurrent executions of logsWsGetHandler()
take place.
Steps to reproduce
Clone repo: https://github.com/mobiusml/aana_sdk.git
Install dstack:
pip install dstack
Configure dstack:
(Optional) Modify
tests.dstack.yml
to disable volume and GPU tests for faster runs:Initialize dstack:
dstack init
Start the test run:
HF_TOKEN="" dstack apply -f tests.dstack.yml
At the end of the run, you will see the following error message:
Actual behaviour
The test runs successfully, and all tests pass, but the task fails at the end with the error
INTERRUPTED_BY_NO_CAPACITY
. This occurs regardless of whether the tests pass or fail, which causes the GitHub Actions workflow to be marked as failed even though the tests are successful.Sample error log:
Expected behaviour
The task should be completed successfully if all the tests pass without encountering the
INTERRUPTED_BY_NO_CAPACITY
error. The GitHub Actions workflow should also reflect the successful test run without marking the task as failed due to the capacity error.be completeddstack version
0.18.15
Server logs
Additional information