Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
677 stars 155 forks source link

Multiple Scheduler Bugs: Deployment Update Resource Allocation and GPU Utilization #303

Open michael-nammi opened 4 months ago

michael-nammi commented 4 months ago

Environment

Kubernetes version: v1.27.9 HAMi version: v2.3.9

Bug 1: Possible Scheduler Bug When Updating Deployment with Insufficient Resources

Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources

Steps to reproduce the issue

  1. Pre-conditions:
    • Node 1: 4GiB Memory, 1 GPU
    • Node 2: 4GiB Memory, 1 GPU
    • Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB)
  2. Create Deployment A:
    • Replicas: 1
    • Memory requirement: 16GiB
    • GPU requirement: 2
  3. Create Deployment B:
    • Replicas: 1
    • Memory requirement: 4GiB
    • GPU requirement: 1
  4. Delete Deployment A
  5. Modify Deployment B
    • Change replicas to 3
    • Change memory requirement to 8GiB
    • Change GPU requirement to 2

      Expected Behavior

      The update should fail because there is not enough memory and GPUs available in the cluster to satisfy the requirements of 3 replicas of Deployment B with the specified resources.

    • Node 1: 4GiB Memory occupied by the pre-existing resources of Deployment B
    • Node 2: Unchanged (idle)
    • Node 3: 2 replicas of Deployment B fully occupied the memory of two GPUs

      Actual Behavior

      The update fails, but the node resource allocation is incorrectly reported:

    • Node 1: 4GiB Memory
    • Node 2: Unchanged (idle)
    • Node 3: Resources are reported as 8GiB and 12GiB, which is inconsistent with the expected result of having all GPUs with full memory

      Prometheus Metrics

      image

Bug 2: Incorrect GPU Utilization

Encountered a scheduler bug when updating a Deployment's resource requirements beyond the available capacity in a Kubernetes cluster with heterogeneous memory and GPU resources

Steps to reproduce the issue

  1. Pre-conditions:
    • Node 1: 4GiB Memory, 1 GPU (Max Utilization: 100%)
    • Node 2: 4GiB Memory, 1 GPU (Max Utilization: 100%)
    • Node 3: 16GiB Memory, 2 GPUs (each GPU with 16GiB Memory and Max Utilization: 100%)
  2. Create Deployment A:
    • Replicas: 1
    • Memory requirement: 4GiB
    • GPU requirement: 1
    • GPUcores requirement: 120 (which implies a requirement of more than 100% GPU utilization if taking "100" as the maximum)

      Expected Behavior

      The deployment should fail to be scheduled due to the GPU utilization requirement exceeding the maximum limit of 100%.

    • Node 1: Memory should remain unallocated (4GiB)
    • Node 2: Memory should remain unallocated (4GiB)
    • Node 3: Both GPUs should remain unallocated (16GiB + 100%, and 16GiB + 100%)

      Actual Behavior

      The deployment is incorrectly scheduled with the following resource allocation:

    • Node 1: Unchanged (4GiB Memory idle)
    • Node 2: Unchanged (4GiB Memory idle)
    • Node 3: Resources are reported incorrectly:
      • First GPU: Appears as if 4GiB Memory + 100% Utilization has been allocated to Deployment A (should be no allocation)
      • Second GPU: Unallocated (16GiB Memory and 100% Utilization idle)
github-actions[bot] commented 4 months ago

Hi @michael-nammi, Thanks for opening an issue! We will look into it as soon as possible.

Details Instructions for interacting with me using comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the [gh-ci-bot](https://github.com/clusterpedia-io/gh-ci-bot) repository.
wawa0210 commented 4 months ago

@michael-nammi It would be great if you could provide the yaml of each test process deployment, which would speed up our troubleshooting.

michael-nammi commented 4 months ago

Here are the yaml files of the deployments:

Test for bug 1

Steps:
  1. Create Deployment A:
    • Replicas: 1
    • Memory requirement: 16GiB
    • GPU requirement: 2
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 2
            nvidia.com/gpumem: 16384
  1. Create Deployment B:
    • Replicas: 1
    • Memory requirement: 4GiB
    • GPU requirement: 1
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 1
            nvidia.com/gpumem: 4096
  1. Delete Deployment A

    • kubectl delete deployment deployment-a
  2. Modify Deployment B

    • Change replicas to 3
    • Change memory requirement to 8GiB
    • Change GPU requirement to 2
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-b
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 2
            nvidia.com/gpumem: 8192

Test for bug 2

Steps:
  1. Create Deployment A:
    • Replicas: 1
    • Memory requirement: 4GiB
    • GPU requirement: 1
    • GPUcores requirement: 120
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu
  template:
    metadata:
      labels:
        app: gpu
    spec:
      containers:
      - name: ubuntu-container
        image: ubuntu:18.04
        command: ["bash", "-c", "sleep 86400"]
        resources:
          limits:
            nvidia.com/gpu: 1
            nvidia.com/gpumem: 4096
            nvidia.com/gpucores: 120