DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
262 stars 44 forks source link

Allow user to specify NVIDIA drivers #224

Closed carbocation closed 7 months ago

carbocation commented 3 years ago

In Issue #215 we had a fairly in-depth discussion of how the COS version is currently used to pick NVIDIA drivers. At the time, that issue was mostly related to a driver rollback, so that issue is closed. However, I think that out of date NVIDIA drivers is becoming a bigger issue, and I wanted to loop back to mention it.

The default COS NVIDIA driver seems to lag significantly behind the latest NVIDIA drivers, such that I often cannot use dsub for tasks that expect up-to-date NVIDIA drivers. For example, the latest COS milestone is 89, which still includes NVIDIA driver v450.119.04. However, that is only compatible with CUDA 11.0, and not 11.1, 11.2, or 11.3. So basically, 8 months after we first chatted about 11.1 in Issue #215, we're still stuck on 11.0 compatibility.

Since the COS NVIDIA driver version lags so significantly, is there any way for the user to override NVIDIA drivers? E.g., can we modify the startup script to fetch a newer version of the drivers in the COS (i.e., below the level of the docker image)?

The specific use case for today's issue is trying to run AlphaFold (which isn't working with CUDA 11.0, but is working with CUDA 11.1).

wnojopra commented 3 years ago

Hi @carbocation,

I raised this issue with the same folks in this thread on gcp-life-sciences-discuss.

Copy-pasting Paul's response here:

1) The COS distributions seem to be built with the security measure of only allowing cryptographically signed drivers:

https://cloud.google.com/container-optimized-os/docs/concepts/features-and-benefits#limitations

https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#security

Basically what this means is that you would not be able to load your own updated CUDA drivers. This would need to be requested via one of the following contact methods at the following link:

https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us

2) If you do want to install your own NVIDIA drivers on Google VM instances, that can be done on a custom VM which is not a COS image. Those instructions are detailed at the following link, which provides instructions and the links for installing CUDA 11.1:

https://cloud.google.com/compute/docs/gpus/install-drivers-gpu

3) Apparently AlphaFold seems to work with 11.0 based on the following two links, though I have not tested it and I am sure you would have more current requirements:

https://github.com/deepmind/alphafold/blob/main/docker/Dockerfile#L15

https://github.com/deepmind/alphafold/blob/main/README.md#first-time-setup

Let me know if there is anything else I can help out with.

As I understand, the lifesciences api is limited by what COS supports, and doesn't seem likely that this can change. We may have better luck talking to a team working on COS to get support.

I am also a bit surprised to see that the AlphaFold links seem to indicate that it works with 11.0 . What is the specific issue you end up seeing when trying to run with 11.0?

carbocation commented 3 years ago

Thanks for the follow-up!

Re: the COS image issue - that's useful to know. Sounds like this isn't going to be configurable on the user's side of things, so if a desired CUDA version isn't available, it will require some outreach to get an appropriate COS image.

Re: the specific alphafold issue - the initially released version of alphafold did not work on the GPU for me or others and the base seems to have been misspecified, although somehow this didn't seem to affect the alphafold team. For me, this only worked when I changed the base CUDA version to 11.2. However, the alphafold team has now updated the dockerfile with a different base image that they believe resolves this problem of 11.0 incompatibility, so I will take a look to see if this works with 11.0 now.

rivershah commented 1 year ago

Bumping this thread. Driver mismatch does cause precision loss and a number of compilation issues. Are we in a better position to allow users to specify nvidia driver version please? For certain operations it is required to be on latest cuDNN so lagging base cuda / cudnn versions is not a great solution. Currently with cuDNN 8.6.0 I get these errors. Downgrading makes them go away but then I am unable to take advantage of certain newly released features.

2022-10-31 09:10:29.339124: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:245] Device: NVIDIA A100-SXM4-40GB
2022-10-31 09:10:29.339135: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:246] Platform: Compute Capability 8.0
2022-10-31 09:10:29.339145: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:247] Driver: 11080 (520.61.5)
2022-10-31 09:10:29.339155: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:248] Runtime: <undefined>
2022-10-31 09:10:29.339172: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:255] cudnn version: 8.6.0
2022-10-31 09:10:29.444706: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 0: 28848 vs 25408
2022-10-31 09:10:29.444765: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 1: 28736 vs 25296
2022-10-31 09:10:29.444777: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 2: 29280 vs 25760
2022-10-31 09:10:29.444786: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 3: 29456 vs 25904
2022-10-31 09:10:29.444801: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 4: 28816 vs 25376
2022-10-31 09:10:29.444817: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 5: 28928 vs 25488
2022-10-31 09:10:29.444826: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 6: 29488 vs 25920
2022-10-31 09:10:29.444837: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 7: 29408 vs 25840
2022-10-31 09:10:29.444846: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 8: 29728 vs 26096
2022-10-31 09:10:29.444855: E tensorflow/compiler/xla/service/gpu/buffer_comparator.cc:731] Difference at 9: 29968 vs 26320
2022-10-31 09:10:29.444945: E tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc:612] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
%cudnn-conv-bw-filter.17 = (f16[3,3,2,16]{2,1,0,3}, u8[0]{0}) custom-call(f16[3200,2,58,64]{1,3,2,0} %bitcast.425, f16[3200,16,58,64]{1,3,2,0} %add.6050), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_01io->bf01, custom_call_target="__cudnn$convBackwardFilter", metadata={op_type="Conv2DBackpropFilter" op_name="gradient_tape/markup_network/conv2d_8/Conv2D/Conv2DBackpropFilter" source_file="/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer_v2/optimizer_v2.py" source_line=510}, backend_config="{\"conv_result_scale\":1,\"activation_mode\":\"0\",\"side_input_scale\":0}" for eng39{k12=46,k13=1,k14=0,k15=0,k17=47,k2=11,k6=0,k22=0} vs eng9{k14=4,k23=0,k2=1,k18=1,k13=0}
wnojopra commented 1 year ago

Hi @rivershah ,

At this time the lifesciences API is still limited by what COS supports.

That being said, I do recommend following up on this thread (and perhaps even starting a new post in gcp-life-sciences-discuss). I do think it'd be valuable information to the team if they were made aware of the specific features you're missing out on due to the lag.

carbocation commented 7 months ago

I'm going to close this issue since I opened it and it is stale.