Open rivershah opened 1 year ago
The mandatory presence of the --gpus=all
flag is also a problem when using container optimized OS (COS). I can run GPU examples in the Ubuntu based CUDA docker images following the instructions at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#e2e, but the --gpus=all
flag is not needed and does not work when using nvidia-container-runtime.
kwargs needed to make COS work, if the --gpus=all
flag was not there.
cos_args = {
# Use COS image with an LTS milestone.
# https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#requirements
"source_image": "projects/cos-cloud/global/images/cos-101-lts",
# https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#installing_drivers_through_cloud-init
# This step takes ~2 minutes.
"extra_bootstrap": [
"cos-extensions install gpu",
"mount --bind /var/lib/nvidia /var/lib/nvidia",
"mount -o remount,exec /var/lib/nvidia",
],
"docker_args": " ".join(
[
"--volume /var/lib/nvidia/lib64:/usr/local/nvidia/lib64",
"--volume /var/lib/nvidia/bin:/usr/local/nvidia/bin",
"--device /dev/nvidia0:/dev/nvidia0",
"--device /dev/nvidia-uvm:/dev/nvidia-uvm",
"--device /dev/nvidiactl:/dev/nvidiactl",
]
),
"bootstrap": False,
}
@siddharthab only Ubuntu is supported currently in dask-cloudprovider
.
During cluster bootstrap, the drivers are installed but they are not available as they are not loaded. It appears that a reboot must be done before
nvidia-smi
becomes available. As the nvidia drivers are not loaded, the command below will fail:cloud-init-output.log
If GPUs are being used, the default image should already have drivers installed and useable or alternatively after driver install the nvidia driver should be loaded without requiring a reboot.
Environment: