dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.6k stars 157 forks source link

Validate NVIDIA container runtime on SSH instances #1947

Open jvstme opened 1 month ago

jvstme commented 1 month ago

Steps to reproduce

  1. Prepare an instance with an NVIDIA GPU, Docker, and CUDA drivers, but without the NVIDIA container runtime (nvidia-container-toolkit).
  2. Create and apply an on-prem fleet configuration with the instance.

Actual behaviour

The fleet is created successfully but the GPU is not mentioned in its resources.

 FLEET    INSTANCE  BACKEND       RESOURCES                    PRICE  STATUS  CREATED     ERROR 
 on-prem  0         ssh (remote)  24xCPU, 71GB, 36.4GB (disk)  $0.0   idle    57 sec ago

The user may not notice that the GPU is missing, in which case they will only find out that something is wrong when trying to run a job on the instance.

Run failed with error code CONTAINER_EXITED_WITH_ERROR.
Error: could not select device driver "" with capabilities: []
Check CLI, server, and run logs for more details.

Expected behaviour

Fleet provisioning fails, the user sees an error about the NVIDIA runtime being misconfigured on the instance.

dstack version

0.18.22

Server logs

[22:37:27] DEBUG    dstack._internal.server.background.tasks.process_instances:388 Received a host_info {'gpu_vendor': 'none', 'gpu_name': '', 'gpu_memory': 0,
                    'gpu_count': 0, 'addresses': ['10.0.160.57/16', 'fe80::17ff:fe09:d261/64', '172.17.0.1/16'], 'disk_size': 39050715136, 'cpus': 24,         
                    'memory': 75869425664}

Additional information

No response

github-actions[bot] commented 12 hours ago

This issue is stale because it has been open for 30 days with no activity.