The NVIDIA driver on your system is too old (found version 10020).

carbocation commented 3 years ago

Not actually 100% sure that this is a dsub issue, but I'm trying to run a Docker image which is based on gcr.io/deeplearning-platform-release/pytorch-gpu.1-6:latest. When I execute python, I get the following error in dsub:

Failure message: Stopped running "user-command": exit status 1: /site-packages/torch/nn/modules/module.py", line 225, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 247, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 463, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 150, in _lazy_init
    _check_driver()
  File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 63, in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError: 
The NVIDIA driver on your system is too old (found version 10020).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

I believe this is mapped via dsub and so this isn't something I can fix on my end. Is that accurate?

mbookman commented 3 years ago

HI @carbocation !

We just recently updated the provider docs (https://github.com/DataBiosphere/dsub/blob/master/docs/providers/README.md) around this:

--accelerator-type:

The Compute Engine accelerator type. See https://cloud.google.com/compute/docs/gpus/ for supported GPU types.

Only NVIDIA GPU accelerators are currently supported. If an NVIDIA GPU is attached, the required runtime libraries will be made available to all containers under /usr/local/nvidia.

Each version of Container-Optimized OS image (used by the Pipelines API) has a default supported NVIDIA GPU driver version. See https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install

Note that attaching a GPU increases the worker VM startup time by a few minutes. (default: None)

Looking at https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install, the version should be 450.51.06 and per CUDA Toolkit and Compatible Driver Versions, that means that only the very latest CUDA driver (11.1.0) would require a newer driver.

I don't know what version 10020 maps to.

On the surface, it seems like the CUDA driver needs to be downgraded. Looking at the deeplearning-containers:

https://cloud.google.com/ai-platform/deep-learning-containers/docs/choosing-container

perhaps an earlier version of pytorch-gpu version is needed instead of 1.6?

carbocation commented 3 years ago

Thanks, @mbookman ! That gave me a bunch of leads to track down.

As noted in those docs, "To run GPUs on Container-Optimized OS VM instances, the Container-Optimized OS release milestone must be a LTS milestone and the milestone number must be 85 or higher" (my emphasis). The default NVIDIA driver for COS 85 LTS is 450.51.06, just as you noted, which is an NVIDIA CUDA 11.0-compatible driver. This specific driver was added in cos-85-13310-1041-9. And by looking at the disk while a dsub-created machine is running, I can confirm that dsub is launching a disk that describes itself as "Derivative of projects/cos-cloud/global/images/cos-stable-85-13310-1041-28.", which is the latest COS 85. Based on the COS 85 defaults, my initial expectation was that dsub would also expose those same NVIDIA CUDA-11.0 compatible drivers.

However, since the NVIDIA drivers are mounted with dsub at /usr/local/nvidia, I was able to run /usr/local/nvidia/bin/nvidia-smi, which indicated a driver version of 440.64.00 which is compatible with an older CUDA version of 10.2 instead of 11.0:

/usr/local/nvidia/bin/nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    66W / 149W |      0MiB / 11441MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I then confirmed that the software on the Docker instance expected CUDA 11.0-compatible drivers. E.g., /usr/local/nvidia/bin/nvcc returns:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Likewise, python's torch.version.cuda reports 11.0.

For completeness, $LD_LIBRARY_PATH reports /usr/local/nvidia/lib:/usr/local/nvidia/lib64.

My conclusion so far:

The gcr.io/deeplearning-platform-release/pytorch-gpu.1-6:latest Docker instance expects an NVIDIA driver that is compatible with CUDA 11.0
dsub is exposing an NVIDIA driver that is only compatible with CUDA up to 10.2

Most likely my understanding is wrong about "what" is choosing the NVIDIA driver to expose because the following doesn't really make sense to me, but my perception is that when dsub launches COS 85, it is exposing a non-default NVIDIA driver (440.64.00) instead of the expected default driver for COS 85 (450.51.06).

So to your point, I will troubleshoot this by trying out older Docker instances, but given that the exposed driver seems to be nonstandard vs the documentation for COS 85, I'm still a bit confused as to why we're getting this older driver.

mbookman commented 3 years ago

Thanks @carbocation .

Doing a bit of further digging, it appears from the serial console when booting a Pipelines API VM that the API controller is explicitly installing version 440.64.00

[   48.011769] exec_start.sh[594]: + NVIDIA_DRIVER_VERSION=440.64.00
[   48.011797] exec_start.sh[594]: + NVIDIA_DRIVER_MD5SUM=
[   48.011826] exec_start.sh[594]: + NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
[   48.011858] exec_start.sh[594]: + NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
[   48.011887] exec_start.sh[594]: + ROOT_MOUNT_DIR=/root
[   48.011915] exec_start.sh[594]: + CACHE_FILE=/usr/local/nvidia/.cache
[   48.013025] exec_start.sh[594]: + LOCK_FILE=/root/tmp/cos_gpu_installer_lock
[   48.013059] exec_start.sh[594]: + LOCK_FILE_FD=20
[   48.013088] exec_start.sh[594]: + set +x
[   48.018760] exec_start.sh[594]: [INFO    2020-12-02 22:18:26 UTC] PRELOAD: false
[   48.020060] exec_start.sh[594]: [INFO    2020-12-02 22:18:26 UTC] Running on COS build id 13310.1041.24
[   48.020772] exec_start.sh[594]: [INFO    2020-12-02 22:18:26 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/13310.1041.24

That seems inconsistent with what was had been posted to GCP Life Sciences Discuss here:

https://groups.google.com/g/gcp-life-sciences-discuss/c/DIhQdGhVZT4

but perhaps the specific planned rollout just hasn't happened yet:

we are planning to deprecate the VirtualMachine.nvidia_driver_version field of the Cloud Life Sciences RunPipeline method shortly after the release of COS 85 to the stable track.

I'll post a question back on that thread to get an update.

carbocation commented 3 years ago

Thanks, @mbookman . I saw the one reply so far, but I don't think that's actionable from a "dsub-user" perspective. At least from what I can tell, that approach seems to require manually downloading the drivers onto the underlying OS itself, which is a layer of abstraction beneath the Docker layer that dsub users get to specify.

carbocation commented 3 years ago

Just looping back to report that, as a stop-gap, the gcr.io/deeplearning-platform-release/pytorch-gpu.1-4:latest image is compatible with the NVIDIA drivers being loaded on the current COS 85 that dsub is launching.

mbookman commented 3 years ago

Great!

And Tim noted in the gcp-life-sciences-discuss thread:

No, this isn't a misunderstanding on your part. We encountered some stability issues when using cos-extensions and so have been waiting to migrate until they are resolved. In the meantime, the version used by the Life Sciences API sometimes lags behind the default COS version. Our release next week will bring it back in sync (450.51.06).

carbocation commented 3 years ago

Nice - there's the answer!

I guess the one follow-up question I have from the other reply is whether the NVIDIA driver version is going to remain something that we (at least, dsub) could specifically specify in the COS startup script. If so, it might be nice to be able to specify this. I'm not confident that I interpreted it correctly, though.

mbookman commented 3 years ago

Updated the gcp-life-sciences-discuss thread asking for some clarifications.

carbocation commented 3 years ago

My impression from that thread is that it might be helpful to be able to specify the COS with dsub (determined by bootImage in the Pipelines API). Since we can't currently set this (at least, as far as I have noticed), it seems that we default to projects/cos-cloud/global/images/family/cos-stable.

When putting together a long-term reproducible pipeline, setting that would be appealing. (Even if old versions expire, at least it makes it easier to track down the provenance of results.)

Did you get a similar impression from the thread? Or am I missing a configuration option that is already built into dsub?

mbookman commented 3 years ago

The thread indicates that we really can't set the baseImage to anything other than to pick up cos-stable or cos-dev and that the expectation is that the COS images will be backward compatible wrt to their nvidia drivers for some time.

I don't think there's anything more to do here unless time proves the backward compatibility of drivers is not reliable.

carbocation commented 3 years ago

Sounds good. I'm going to close this issue. Thanks pursuing this on the mailing list and for your follow-up here.

DataBiosphere / dsub

The NVIDIA driver on your system is too old (found version 10020). #215