aws-samples / eks-workshop

AWS Workshop for Learning EKS
https://eksworkshop.com
MIT No Attribution
803 stars 1.24k forks source link

Kubeflow MNIST inference deployment CPU/GPU requirements #508

Closed swoldemi closed 4 years ago

swoldemi commented 4 years ago

When running the inference pod at https://eksworkshop.com/kubeflow/inference/#run-inference-pod, the deployment uses a GPU image (tensorflow/serving:1.11.1-gpu), but there wasn't any instruction to create the cluster specifying GPU enabled instances; was this an oversight or did I miss something? Also, it requests 1 CPU (1000 millicpu). Events show that the Pod can't be scheduled.

From my Cloud9 instance:

Admin:~/environment/eksworkshop-eksctl $ kubectl describe pod mnist-inference-85fbc7c86-ws9jf
Name:               mnist-inference-85fbc7c86-ws9jf
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             app=mnist
                    framework=tensorflow
                    pod-template-hash=85fbc7c86
                    type=inference
                    version=v1
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/mnist-inference-85fbc7c86
Containers:
  mnist:
    Image:       tensorflow/serving:1.11.1-gpu
    Ports:       9000/TCP, 8500/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      /usr/bin/tensorflow_model_server
    Args:
      --port=9000
      --rest_api_port=8500
      --model_name=mnist
      --model_base_path=s3://zcpzfd-eks-ml-data/mnist/tf_saved_model
    Limits:
      cpu:             4
      memory:          4Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  1
    Liveness:          tcp-socket :9000 delay=30s timeout=1s period=30s #success=1 #failure=3
    Environment:
      AWS_ACCESS_KEY_ID:      <set to the key 'AWS_ACCESS_KEY_ID' in secret 'aws-secret'>      Optional: false
      AWS_SECRET_ACCESS_KEY:  <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 'aws-secret'>  Optional: false
      AWS_REGION:             us-east-2
      S3_USE_HTTPS:           true
      S3_VERIFY_SSL:          true
      S3_ENDPOINT:            s3.us-east-2.amazonaws.com
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6hkh6 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-6hkh6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6hkh6
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  71s (x4 over 5m29s)  default-scheduler  0/3 nodes are available: 2 Insufficient cpu, 3 Insufficient nvidia.com/gpu.

I've installed Metrics Sever to my cluster and verified that I do not have CPU and memory available (No nodes can provide 1000 millicpu):

Admin:~/environment/eksworkshop-eksctl $ kubectl top no
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
ip-192-168-14-222.us-east-2.compute.internal   207m         10%    2680Mi          35%       
ip-192-168-40-31.us-east-2.compute.internal    108m         5%     717Mi           9%        
ip-192-168-74-214.us-east-2.compute.internal   171m         8%     1150Mi          15%      

And that there are no GPUs:

Admin:~/environment/eksworkshop-eksctl $ kubectl describe nodes  |  tr -d '\000' | sed -n -e '/^Name/,/Roles/p' -e '/^Capacity/,/Allocatable/p' -e '/^Allocated resources/,/Events/p'  | grep -e Name  -e  nvidia.com  | perl -pe 's/\n//'  |  perl -pe 's/Name:/\n/g' | sed 's/nvidia.com\/gpu:\?//g'  | sed '1s/^/Node Available(GPUs)  Used(GPUs)/' | sed 's/$/ 0 0 0/'  | awk '{print $1, $2, $3}'  | column -t
Node                                          Available(GPUs)  Used(GPUs)
ip-192-168-14-222.us-east-2.compute.internal  0                0
ip-192-168-40-31.us-east-2.compute.internal   0                0
ip-192-168-74-214.us-east-2.compute.internal  0                0
dalbhanj commented 4 years ago

Thanks @swoldemi for this issue. I'll add instructions to increase cluster size to 6 to accommodate resources for Kubeflow module. @Jeffwan, can you take a look at modifying GPU to CPU based inference?

dalbhanj commented 4 years ago

@swoldemi resource issue is fixed.

swoldemi commented 4 years ago

@dalbhanj Scaling the nodegroup up to 6 nodes still fails because of the instance type. Would it be better to create the cluster using a new instance type?:

Admin:~/environment/eksworkshop-eksctl $ kubectl describe pod mnist-inference-fbb9dcf5-4hkj6
Name:               mnist-inference-fbb9dcf5-4hkj6
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             app=mnist
                    framework=tensorflow
                    pod-template-hash=fbb9dcf5
                    type=inference
                    version=v1
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Pending
IP:                 
Controlled By:      ReplicaSet/mnist-inference-fbb9dcf5
Containers:
  mnist:
    Image:       tensorflow/serving
    Ports:       9000/TCP, 8500/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      /usr/bin/tensorflow_model_server
    Args:
      --port=9000
      --rest_api_port=8500
      --model_name=mnist
      --model_base_path=s3:///mnist/tf_saved_model
    Limits:
      cpu:             4
      memory:          4Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  1
    Liveness:          tcp-socket :9000 delay=30s timeout=1s period=30s #success=1 #failure=3
    Environment:
      AWS_ACCESS_KEY_ID:      <set to the key 'AWS_ACCESS_KEY_ID' in secret 'aws-secret'>      Optional: false
      AWS_SECRET_ACCESS_KEY:  <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 'aws-secret'>  Optional: false
      AWS_REGION:             us-east-2
      S3_USE_HTTPS:           true
      S3_VERIFY_SSL:          true
      S3_ENDPOINT:            s3.us-east-2.amazonaws.com
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6hkh6 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-6hkh6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6hkh6
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  4m54s (x52 over 79m)  default-scheduler  0/6 nodes are available: 2 Insufficient cpu, 6 Insufficient nvidia.com/gpu.
Admin:~/environment/eksworkshop-eksctl $ kubectl get no
NAME                                           STATUS   ROLES    AGE    VERSION
ip-192-168-14-222.us-east-2.compute.internal   Ready    <none>   16h    v1.14.7-eks-1861c5
ip-192-168-18-169.us-east-2.compute.internal   Ready    <none>   126m   v1.14.7-eks-1861c5
ip-192-168-36-73.us-east-2.compute.internal    Ready    <none>   126m   v1.14.7-eks-1861c5
ip-192-168-40-31.us-east-2.compute.internal    Ready    <none>   16h    v1.14.7-eks-1861c5
ip-192-168-74-214.us-east-2.compute.internal   Ready    <none>   16h    v1.14.7-eks-1861c5
ip-192-168-89-184.us-east-2.compute.internal   Ready    <none>   126m   v1.14.7-eks-1861c5
swoldemi commented 4 years ago

My mistake, I didn't update my manifest. Pod is scheduled!