docker run --gpus=all will fail as nvidia-smi not available after bootstrap

rivershah commented 1 year ago

During cluster bootstrap, the drivers are installed but they are not available as they are not loaded. It appears that a reboot must be done before nvidia-smi becomes available. As the nvidia drivers are not loaded, the command below will fail:

docker run --net=host --gpus=all ...

from dask_cloudprovider.gcp import GCPCluster

def test_dask_gcp_cluster_gpu():
    cluster = GCPCluster(
        machine_type="n1-standard-8",
        n_workers=1,
        filesystem_size=100,
        gpu_type="nvidia-tesla-t4",
        ngpus=1,
    )

cloud-init-output.log

Status: Downloaded newer image for daskdev/dask:latest
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.

If GPUs are being used, the default image should already have drivers installed and useable or alternatively after driver install the nvidia driver should be loaded without requiring a reboot.

Environment:

Dask version: 2022.9.2
Python version: 3.10
Operating System: ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014
Install method (conda, pip, source): pip

siddharthab commented 1 year ago

The mandatory presence of the --gpus=all flag is also a problem when using container optimized OS (COS). I can run GPU examples in the Ubuntu based CUDA docker images following the instructions at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#e2e, but the --gpus=all flag is not needed and does not work when using nvidia-container-runtime.

kwargs needed to make COS work, if the --gpus=all flag was not there.

cos_args = {
    # Use COS image with an LTS milestone.
    # https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#requirements
    "source_image": "projects/cos-cloud/global/images/cos-101-lts",
    # https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#installing_drivers_through_cloud-init
    # This step takes ~2 minutes.
    "extra_bootstrap": [
        "cos-extensions install gpu",
        "mount --bind /var/lib/nvidia /var/lib/nvidia",
        "mount -o remount,exec /var/lib/nvidia",
    ],
    "docker_args": " ".join(
        [
            "--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64",
            "--volume /var/lib/nvidia/bin:/usr/local/nvidia/bin",
            "--device /dev/nvidia0:/dev/nvidia0",
            "--device /dev/nvidia-uvm:/dev/nvidia-uvm",
            "--device /dev/nvidiactl:/dev/nvidiactl",
        ]
    ),
    "bootstrap": False,
}

jacobtomlinson commented 1 year ago

@siddharthab only Ubuntu is supported currently in dask-cloudprovider.

dask / dask-cloudprovider

docker run --gpus=all will fail as nvidia-smi not available after bootstrap #393