awslabs / aws-virtual-gpu-device-plugin

AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads
https://aws.amazon.com/blogs/opensource/virtual-gpu-device-plugin-for-inference-workload-in-kubernetes/
Apache License 2.0
203 stars 31 forks source link

Autoscaler support #20

Open dempti opened 3 years ago

dempti commented 3 years ago

GPU sharing works perfectly fine, but when trying to scale pods based on gpu share, cluster-autoscaler is unable to scale instances based on requirement with following errors.

clusterautoscaler-aws-cluster-autoscaler-6dbcb4d4f7-fv5w7 aws-cluster-autoscaler I0908 02:56:58.534530       1 scale_up.go:288] Pod resnet-deployment-8978c7f89-2469s can't be scheduled on eks-clusterNodegroupclusterdefa-aTwrPbQ2r3sD-60bdce6a-2014-ffda-69e8-b6f67da592f2, predicate checking error: Insufficient k8s.amazonaws.com/vgpu; predicateName=NodeResourcesFit; reasons: Insufficient k8s.amazonaws.com/vgpu; debugInfo=
clusterautoscaler-aws-cluster-autoscaler-6696574c75-zf65d aws-cluster-autoscaler I0908 03:01:29.704188       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"resnet-deployment-8978c7f89-gtnf6", UID:"a147f1be-a9d7-45a0-bb72-cd26a783ef9c", APIVersion:"v1", ResourceVersion:"5616318", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up: 1 Insufficient k8s.amazonaws.com/vgpu
alexpirogovski commented 3 years ago

Works for me. Here are the steps:

  1. Create a node group config file with GPU-capable instances for the eksctl tool and:
  2. Add this label to the config file (in the labels section) k8s.amazonaws.com/accelerator: vgpu
  3. Add two tags in the tags section:
    k8s.io/cluster-autoscaler/node-template/label/k8s.amazonaws.com/accelerator: vgpu
    k8s.io/cluster-autoscaler/node-template/resources/k8s.amazonaws.com/vgpu: "2"
  4. Create the node group with --install-nvidia-plugin=false

The newly created nodes will be properly labeled for the vgpu plugin and the autosdcaler will know that this node group can provide the necessary resources when a pod requests them

Source (under Scaling from zero): https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html#ca-view-logs

admiral-srinjoy commented 2 years ago

@alexpirogovski i believe the nvidia plugin isn't installed by default and we need to separately install it as a daemonset in that case is step 4 necessary , i was not able to get it to work with the 3 other additions you suggested

dempti commented 2 years ago

@admiral-srinjoy you can follow this issue for solution. https://github.com/kubernetes/autoscaler/issues/4315

alexpirogovski commented 2 years ago

@alexpirogovski i believe the nvidia plugin isn't installed by default and we need to separately install it as a daemonset in that case is step 4 necessary , i was not able to get it to work with the 3 other additions you suggested

@admiral-srinjoy AFAIR nvidia plugin and aws-virtual-gpu-device-plugin are mutually exclusive

admiral-srinjoy commented 2 years ago

Thanks @dempti this helps