Open mjuric opened 4 years ago
@bsipocz @stevenstetzler Add missing issues, and/or solution information here so we don't forget it.
GPU AMIs: https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html
Nvidia device plugin: https://github.com/NVIDIA/k8s-device-plugin
Make sure we add GPU instance support for AWS deployments. This is a tracker issue for various pieces of this problem, and based on experiences with astroML demo prep.
Todo: [ ] Start the GPU nodes with a recommended AMI [ ] Patch the k8s deployment so EKS recognizes the GPU nodes (problem may have gone away by now) [ ] Deploy nvidia-device-plugin into the k8s cluster (helm chart) [ ] Start containers with the
NVIDIA_DRIVER_CAPABILITIES: "all"
environment variable [ ] Write a small script/utility to verify everything has been set up correctly and is working.Add anything that's missing.