Quansight / open-gpu-server

The Open GPU Server for CI purpose.
8 stars 12 forks source link

Demanding builds can starve the runner process and kill the job #28

Open jaimergp opened 6 months ago

jaimergp commented 6 months ago

This is a known issue, which presents with the following symptoms:

image
2024-02-28T04:21:10.5308211Z Requested labels: cirun-openstack-cpu-large--8075232475-linux_aarch64_, linux, x64, self-hosted
2024-02-28T04:21:10.5308629Z Job defined at: conda-forge/mongodb-feedstock/.github/workflows/conda-build.yml@refs/pull/80/merge
2024-02-28T04:21:10.5308806Z Waiting for a runner to pick up this job...
2024-02-28T04:22:29.1192514Z Job is about to start running on the runner: cirun-conda-forge--mongodb-feedstock-fd6a0cc (repository)

The self-hosted runner: cirun-conda-forge--mongodb-feedstock-fd6a0cc lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.


The solution is to either reduce resource usage or, if there are no other options, upgrade to a larger runner.