NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.12k stars 230 forks source link

Driver Capabilities GKE timesharing #165

Open patrickcorrigan opened 1 year ago

patrickcorrigan commented 1 year ago

I'm running nvidia/cuda:11.8.0-base-ubuntu20.04 on Google Kubernetes Engine using GPU Timesharing on T4 gpus

Checking the driver Capabilities I get compute and utility. I was hoping to also get graphics and video. Is this a limitation of Timesharing on GKE?

elezar commented 1 year ago

You need to set the NVIDIA_DRIVER_CAPABILITIES=all in the container that you are starting. Just a note, if you're using the Google Device plugin, then there is no support for NVIDIA_DRIVER_CAPABILITIES at all, and changing this should not affect the libraries / binaries that are included.

Is there something that you're trying to do in the container that you cannot?

patrickcorrigan commented 1 year ago

Hi @elezar,

Yes I did use the Google Device plugin to install the drivers on my nodes using the DaemonSet

I'm currently running containers on a cluster where I start X Window sessions, run graphical applications, capture the display, encode it to h.264 and stream it over WebRTC to users. Unfortunately I'm doing all of this on the CPU 🥲 Which doesn't lead to great performance for more intense work loads.

I was hoping I could add in some T4 gpu nodes for a boost in rendering performance and to use the NVEC encoder for faster h.264 encoding 🚀

I wanted to share the T4 between 4 containers to make it more economical to run 💵

I can get the pods up and running no problem. I can start an X Session using an xf86-video-dummy like I used to when using a CPU only node but when I run graphical applications it doesn't seem to use the GPU at all 🐢 I can see no processes in nvidia-smi

So I used nvidia-xconfig to create a x config with a dummy display that specified the nvidia driver. Unfortunately I get

Failed to module 'nvidia' (Module does not exist, 0)

I was wondering if I needed to install the gpu drivers in the container too and that's what lead me here 👋 and investigating driver capabilities. This was the most isolated question I could think of asking.

TL;DR 📖 Running hardware accelerated X Sessions, capturing and then encoding the output using NVEC encoder and streaming it over WebRTC.

Thank your for your quick and helpful response 🙏