Closed swoldemi closed 4 years ago
Thanks @swoldemi for this issue. I'll add instructions to increase cluster size to 6 to accommodate resources for Kubeflow module. @Jeffwan, can you take a look at modifying GPU to CPU based inference?
@swoldemi resource issue is fixed.
@dalbhanj Scaling the nodegroup up to 6 nodes still fails because of the instance type. Would it be better to create the cluster using a new instance type?:
Admin:~/environment/eksworkshop-eksctl $ kubectl describe pod mnist-inference-fbb9dcf5-4hkj6
Name: mnist-inference-fbb9dcf5-4hkj6
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: <none>
Labels: app=mnist
framework=tensorflow
pod-template-hash=fbb9dcf5
type=inference
version=v1
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Controlled By: ReplicaSet/mnist-inference-fbb9dcf5
Containers:
mnist:
Image: tensorflow/serving
Ports: 9000/TCP, 8500/TCP
Host Ports: 0/TCP, 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--rest_api_port=8500
--model_name=mnist
--model_base_path=s3:///mnist/tf_saved_model
Limits:
cpu: 4
memory: 4Gi
nvidia.com/gpu: 1
Requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 1
Liveness: tcp-socket :9000 delay=30s timeout=1s period=30s #success=1 #failure=3
Environment:
AWS_ACCESS_KEY_ID: <set to the key 'AWS_ACCESS_KEY_ID' in secret 'aws-secret'> Optional: false
AWS_SECRET_ACCESS_KEY: <set to the key 'AWS_SECRET_ACCESS_KEY' in secret 'aws-secret'> Optional: false
AWS_REGION: us-east-2
S3_USE_HTTPS: true
S3_VERIFY_SSL: true
S3_ENDPOINT: s3.us-east-2.amazonaws.com
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6hkh6 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-6hkh6:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6hkh6
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m54s (x52 over 79m) default-scheduler 0/6 nodes are available: 2 Insufficient cpu, 6 Insufficient nvidia.com/gpu.
Admin:~/environment/eksworkshop-eksctl $ kubectl get no
NAME STATUS ROLES AGE VERSION
ip-192-168-14-222.us-east-2.compute.internal Ready <none> 16h v1.14.7-eks-1861c5
ip-192-168-18-169.us-east-2.compute.internal Ready <none> 126m v1.14.7-eks-1861c5
ip-192-168-36-73.us-east-2.compute.internal Ready <none> 126m v1.14.7-eks-1861c5
ip-192-168-40-31.us-east-2.compute.internal Ready <none> 16h v1.14.7-eks-1861c5
ip-192-168-74-214.us-east-2.compute.internal Ready <none> 16h v1.14.7-eks-1861c5
ip-192-168-89-184.us-east-2.compute.internal Ready <none> 126m v1.14.7-eks-1861c5
My mistake, I didn't update my manifest. Pod is scheduled!
When running the inference pod at https://eksworkshop.com/kubeflow/inference/#run-inference-pod, the deployment uses a GPU image (tensorflow/serving:1.11.1-gpu), but there wasn't any instruction to create the cluster specifying GPU enabled instances; was this an oversight or did I miss something? Also, it requests 1 CPU (1000 millicpu). Events show that the Pod can't be scheduled.
From my Cloud9 instance:
I've installed Metrics Sever to my cluster and verified that I do not have CPU and memory available (No nodes can provide 1000 millicpu):
And that there are no GPUs: