GPU accelerated hypervisor for containers: nvidia-docker

GrabbenD commented 4 years ago

We are interested in running Clear Linux as a GPU accelerated hypervisor with AWS' EC2 service in order to provide our containers with GPU HVM which we'd use for AI in Tensorflow. In order to achieve this we need nvidia-docker package, is this something that you'd be able to bundle or provide documentation for?

nvidia-docker Github: source Documentation: wiki

Software requirements

~GNU/Linux x8664 with kernel version > 3.10~ (kernel-native is currently at 5)_
Docker >= 19.03 (requested here: #1313)
~NVIDIA drivers~ (available through docs and you might need to adjust the installer parameters, eg remove: --no-nvidia-modprobe)

Thanks

bryteise commented 4 years ago

This looks like just adding a commandline flag and a config file. I don't really see much point in the wrapper script adding the commandline flag. The config file is also something you can add as a commandline flag or add your own default config for since it is something we'd expect to be configured by the end user.

GrabbenD commented 4 years ago

You're correct @bryteise, it's possible to install this manually by hand but it would help out a ton if this was a part of the official repository in Clear Linux for easier orchestration.

bryteise commented 4 years ago

So I was confused as to what is providing the nvidia-container-runtime that is being enabled in the config file. It is natively supported in the docker as of v19.03. Unfortunately Clear Linux doesn't use that release instead we are on the lts version.

bryteise commented 4 years ago

I'm still not really in favor of adding the nvidia-docker script though. The expectation for having a script that would pass --runtime for each ecosystem isn't especially appealing, especially when that runtime may not even exist.

Looking at the kata integration work we've done, it may set the default runtime to use the kata runtime. But first it looks for hardware support to do so.

This has given us a few problems and I am considering removing that integration. The different runtimes would be enabled as part of the docker service but runc would be the default.

Having packages to set the default runtime for docker seems like the wrong thing for us to be doing at this point.

bryteise commented 4 years ago

Would enabling the runtime to be selected by default be a reasonable compromise (when we update to a docker version supporting the feature)?

GrabbenD commented 4 years ago

So I was confused as to what is providing the nvidia-container-runtime that is being enabled in the config file. It is natively supported in the docker as of v19.03. Unfortunately Clear Linux doesn't use that release instead we are on the lts version.

That's very unfortunate and I hope you'll reconsider or offer the non-lts as a standalone bundle since Docker stable 19.03 provides native GPU passthrough among other new features that will highly benefit the container's performance which is essential with AI workloads.

The expectation for having a script that would pass --runtime for each ecosystem isn't especially appealing, especially when that runtime may not even exist.

@bryteise That's true only if you execute nvidia-docker binary, the standardized docker executable is independent and doesn't pass this parameter. This means that if you'd like to use $ docker you need to pass --runtime=nvidia parameter, otherwise you can use nvidia's binary nvidia-docker that does exactly that automatically.

clearlinux / distribution

GPU accelerated hypervisor for containers: nvidia-docker #1314