Open michael-nammi opened 4 months ago
Hi @michael-nammi, Thanks for opening an issue! We will look into it as soon as possible.
@michael-nammi It would be great if you could provide the yaml of each test process deployment, which would speed up our troubleshooting.
Here are the yaml files of the deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-a
spec:
replicas: 1
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 2
nvidia.com/gpumem: 16384
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-b
spec:
replicas: 1
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 4096
Delete Deployment A
kubectl delete deployment deployment-a
Modify Deployment B
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-b
spec:
replicas: 3
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 2
nvidia.com/gpumem: 8192
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-a
spec:
replicas: 1
selector:
matchLabels:
app: gpu
template:
metadata:
labels:
app: gpu
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 4096
nvidia.com/gpucores: 120
Environment
Kubernetes version: v1.27.9 HAMi version: v2.3.9
Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources
Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources
Steps to reproduce the issue
Expected Behavior
The update should fail because there is not enough memory and GPUs available in the cluster to satisfy the requirements of 3 replicas of Deployment B with the specified resources.
Actual Behavior
The update fails, but the node resource allocation is incorrectly reported:
Prometheus Metrics
Bug 2: Incorrect GPU Utilization
Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources
Steps to reproduce the issue
Expected Behavior
The deployment should fail to be scheduled due to the GPU utilization requirement exceeding the maximum limit of 100%.
Actual Behavior
The deployment is incorrectly scheduled with the following resource allocation: