Closed carbocation closed 3 years ago
HI @carbocation !
We just recently updated the provider docs (https://github.com/DataBiosphere/dsub/blob/master/docs/providers/README.md) around this:
- --accelerator-type:
The Compute Engine accelerator type. See https://cloud.google.com/compute/docs/gpus/ for supported GPU types.
Only NVIDIA GPU accelerators are currently supported. If an NVIDIA GPU is attached, the required runtime libraries will be made available to all containers under /usr/local/nvidia.
Each version of Container-Optimized OS image (used by the Pipelines API) has a default supported NVIDIA GPU driver version. See https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install
Note that attaching a GPU increases the worker VM startup time by a few minutes. (default: None)
Looking at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install, the version should be 450.51.06
and per CUDA Toolkit and Compatible Driver Versions, that means that only the very latest CUDA driver (11.1.0) would require a newer driver.
I don't know what version 10020
maps to.
On the surface, it seems like the CUDA driver needs to be downgraded. Looking at the deeplearning-containers:
https://cloud.google.com/ai-platform/deep-learning-containers/docs/choosing-container
perhaps an earlier version of pytorch-gpu
version is needed instead of 1.6
?
Thanks, @mbookman ! That gave me a bunch of leads to track down.
As noted in those docs, "To run GPUs on Container-Optimized OS VM instances, the Container-Optimized OS release milestone must be a LTS milestone and the milestone number must be 85 or higher" (my emphasis). The default NVIDIA driver for COS 85 LTS is 450.51.06
, just as you noted, which is an NVIDIA CUDA 11.0-compatible driver. This specific driver was added in cos-85-13310-1041-9
. And by looking at the disk while a dsub
-created machine is running, I can confirm that dsub is launching a disk that describes itself as "Derivative of projects/cos-cloud/global/images/cos-stable-85-13310-1041-28
.", which is the latest COS 85. Based on the COS 85 defaults, my initial expectation was that dsub
would also expose those same NVIDIA CUDA-11.0 compatible drivers.
However, since the NVIDIA drivers are mounted with dsub at /usr/local/nvidia
, I was able to run /usr/local/nvidia/bin/nvidia-smi
, which indicated a driver version of 440.64.00
which is compatible with an older CUDA version of 10.2 instead of 11.0:
/usr/local/nvidia/bin/nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 66W / 149W | 0MiB / 11441MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I then confirmed that the software on the Docker instance expected CUDA 11.0-compatible drivers. E.g., /usr/local/nvidia/bin/nvcc
returns:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
Likewise, python's torch.version.cuda
reports 11.0
.
For completeness, $LD_LIBRARY_PATH
reports /usr/local/nvidia/lib:/usr/local/nvidia/lib64
.
My conclusion so far:
gcr.io/deeplearning-platform-release/pytorch-gpu.1-6:latest
Docker instance expects an NVIDIA driver that is compatible with CUDA 11.0dsub
is exposing an NVIDIA driver that is only compatible with CUDA up to 10.2Most likely my understanding is wrong about "what" is choosing the NVIDIA driver to expose because the following doesn't really make sense to me, but my perception is that when dsub
launches COS 85, it is exposing a non-default NVIDIA driver (440.64.00
) instead of the expected default driver for COS 85 (450.51.06
).
So to your point, I will troubleshoot this by trying out older Docker instances, but given that the exposed driver seems to be nonstandard vs the documentation for COS 85, I'm still a bit confused as to why we're getting this older driver.
Thanks @carbocation .
Doing a bit of further digging, it appears from the serial console when booting a Pipelines API VM that the API controller is explicitly installing version 440.64.00
[ 48.011769] exec_start.sh[594]: + NVIDIA_DRIVER_VERSION=440.64.00
[ 48.011797] exec_start.sh[594]: + NVIDIA_DRIVER_MD5SUM=
[ 48.011826] exec_start.sh[594]: + NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
[ 48.011858] exec_start.sh[594]: + NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
[ 48.011887] exec_start.sh[594]: + ROOT_MOUNT_DIR=/root
[ 48.011915] exec_start.sh[594]: + CACHE_FILE=/usr/local/nvidia/.cache
[ 48.013025] exec_start.sh[594]: + LOCK_FILE=/root/tmp/cos_gpu_installer_lock
[ 48.013059] exec_start.sh[594]: + LOCK_FILE_FD=20
[ 48.013088] exec_start.sh[594]: + set +x
[ 48.018760] exec_start.sh[594]: [INFO 2020-12-02 22:18:26 UTC] PRELOAD: false
[ 48.020060] exec_start.sh[594]: [INFO 2020-12-02 22:18:26 UTC] Running on COS build id 13310.1041.24
[ 48.020772] exec_start.sh[594]: [INFO 2020-12-02 22:18:26 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/13310.1041.24
That seems inconsistent with what was had been posted to GCP Life Sciences Discuss here:
https://groups.google.com/g/gcp-life-sciences-discuss/c/DIhQdGhVZT4
but perhaps the specific planned rollout just hasn't happened yet:
we are planning to deprecate the VirtualMachine.nvidia_driver_version field of the Cloud Life Sciences RunPipeline method shortly after the release of COS 85 to the stable track.
I'll post a question back on that thread to get an update.
Thanks, @mbookman . I saw the one reply so far, but I don't think that's actionable from a "dsub
-user" perspective. At least from what I can tell, that approach seems to require manually downloading the drivers onto the underlying OS itself, which is a layer of abstraction beneath the Docker layer that dsub
users get to specify.
Just looping back to report that, as a stop-gap, the gcr.io/deeplearning-platform-release/pytorch-gpu.1-4:latest
image is compatible with the NVIDIA drivers being loaded on the current COS 85 that dsub
is launching.
Great!
And Tim noted in the gcp-life-sciences-discuss thread:
No, this isn't a misunderstanding on your part. We encountered some stability issues when using cos-extensions and so have been waiting to migrate until they are resolved. In the meantime, the version used by the Life Sciences API sometimes lags behind the default COS version. Our release next week will bring it back in sync (450.51.06).
Nice - there's the answer!
I guess the one follow-up question I have from the other reply is whether the NVIDIA driver version is going to remain something that we (at least, dsub
) could specifically specify in the COS startup script. If so, it might be nice to be able to specify this. I'm not confident that I interpreted it correctly, though.
Updated the gcp-life-sciences-discuss thread asking for some clarifications.
My impression from that thread is that it might be helpful to be able to specify the COS with dsub
(determined by bootImage
in the Pipelines API). Since we can't currently set this (at least, as far as I have noticed), it seems that we default to projects/cos-cloud/global/images/family/cos-stable
.
When putting together a long-term reproducible pipeline, setting that would be appealing. (Even if old versions expire, at least it makes it easier to track down the provenance of results.)
Did you get a similar impression from the thread? Or am I missing a configuration option that is already built into dsub
?
The thread indicates that we really can't set the baseImage to anything other than to pick up cos-stable or cos-dev and that the expectation is that the COS images will be backward compatible wrt to their nvidia drivers for some time.
I don't think there's anything more to do here unless time proves the backward compatibility of drivers is not reliable.
Sounds good. I'm going to close this issue. Thanks pursuing this on the mailing list and for your follow-up here.
Not actually 100% sure that this is a dsub issue, but I'm trying to run a Docker image which is based on
gcr.io/deeplearning-platform-release/pytorch-gpu.1-6:latest
. When I execute python, I get the following error in dsub:I believe this is mapped via dsub and so this isn't something I can fix on my end. Is that accurate?