aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[EKS] [request]: Document the version of Nvidia drivers shipped with GPU-optimized images #955

Open josegonzalez opened 4 years ago

josegonzalez commented 4 years ago

Community Note

Tell us about your request

It would be great if you could document the version of the Nvidia Drivers supported by the GPU-optimized images. Browsing here gives me no real clue as to what they might be, which makes it more difficult to support folks writing cuda apps.

For those who aren't aware, Cuda version is tied to Nvidia Driver version.

Which service(s) is this request for?

EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

To figure this out now, I need to provision a given image to EC2 and run nvidia-smi. This is automateable, but annoying and expensive. Additionally, the underlying image AWS ships can change over time, meaning this must be done on a regular basis.

Are you currently working around this issue?

Currently I'm going to eat some ice cream - drumsticks! - but will likely do my suggestion above to suss out what we can/cannot support out of the box (and then work backwards to get the version that supports 10.1).

Additional context

Ya'll are pretty great!

Attachments

Not related to this issue, but in case you needed something to brighten your day, here is a pic of my cat sunbathing.

2014-08-31 14 29 49

mikestef9 commented 4 years ago

Hey @josegonzalez, we document the driver version on this page

https://docs.aws.amazon.com/eks/latest/userguide/eks-linux-ami-versions.html

Is this what you are looking for?

josegonzalez commented 4 years ago

Ah thats great! Would it make sense to update the docs for the "Example GPU manifest" to reference the supported cuda version? I believe that would be 10.1 based on the nvidia driver version, but currently the docs show 9.2 usage.

bryantbiggs commented 2 months ago

The CUDA compatibility is documented here https://docs.nvidia.com/deploy/cuda-compatibility/ - using the NVIDIA data center driver version that we supply in the EKS AMI release notes you can cross reference to find the compatible CUDA versions (supplied in your container)

libcuda.so (see figure 1 from the link above) is installed on the EKS optimized GPU AMI for the NVIDIA driver as part of the driver installation - the the version of CUDA that users are typically interested in is the version within their container image that is used by their application. Some application frameworks like PyTorch will provide the CUDA libraries they require either when installing via pip or when using the PyTorch supplied container images (ref 1, ref 2)