Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
957 stars 197 forks source link

多容器 Pod 中 GPU 容器夹在非 GPU 容器之间,导致数组越界错误(Array Out-of-Bounds Error When GPU Containers Are Placed Between Non-GPU Containers in a Multi-Container Pod) #571

Closed Nimbus318 closed 3 weeks ago

Nimbus318 commented 4 weeks ago

问题描述

因为我实际的业务 pod 比较复杂,但是我可以简化为这个 pod 声明,同样可以复现这个 bug,我测试过多种情况,可以总结为:

以下是可以复现问题最简化的 Pod 声明:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container-1
      image: ubuntu:20.04
      command: ["bash", "-c", "sleep 86400"]
    - name: ubuntu-container-2
      image: ubuntu:20.04
      command: [ "bash", "-c", "sleep 86400" ]
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 3000
          nvidia.com/gpucores: 30
    - name: ubuntu-container-3
      image: ubuntu:20.04
      command: [ "bash", "-c", "sleep 86400" ]

只要满足我说的上述条件,Pod 就会进入 Pending 状态,Pod 的 Event Message 如下:

Post "https://127.0.0.1:443/filter": stream error: stream ID 1; INTERNAL_ERROR; received from peer

同时,在 hami-scheduler 的日志中出现了数组越界的错误:

panic serving 127.0.0.1:60628: runtime error: index out of range [2] with length 2

调用栈指向的代码行是: score.Devices[idx][ctrid] = append(score.Devices[idx][ctrid], util.ContainerDevice{}) 相关代码段如下:

if sums == 0 {
    for idx := range score.Devices {
        if len(score.Devices[idx]) <= ctrid {
            score.Devices[idx] = append(score.Devices[idx], util.ContainerDevices{})
        }
        score.Devices[idx][ctrid] = append(score.Devices[idx][ctrid], util.ContainerDevice{})
        continue
    }
}

问题分析

ctrid 超出了 score.Devices[idx] 的长度,导致数组越界

从目前的代码逻辑来看:

问题出在:

解决方案

为了避免数组越界,需要确保 score.Devices[idx] 的长度足够容纳 ctrid 在对 score.Devices[idx][ctrid] 进行 append 操作之前,需要确保 score.Devices[idx] 的长度足够

修改后的代码如下:

if sums == 0 {
    for idx := range score.Devices {
        // if len(score.Devices[idx]) <= ctrid {   if -> for
        // Use a for loop to ensure the length is sufficient
        for len(score.Devices[idx]) <= ctrid {
            score.Devices[idx] = append(score.Devices[idx], util.ContainerDevices{})
        }
        score.Devices[idx][ctrid] = append(score.Devices[idx][ctrid], util.ContainerDevice{})
        continue
    }
}
Nimbus318 commented 4 weeks ago

在这个 PR #572 里,我添加了相关单测,用来覆盖「GPU container」夹在「非 GPU container」中间的情况,也可以用这个单测来复现这个 bug,但是需要注意的是,如果只运行这个单测,需要加上 device.InitDevices()

func Test_calcScore(t *testing.T) {
    /*
        Uncomment this line if you're running this single test.
        If you're running `make test`, keep this commented out, as there's another test
        (pkg/k8sutil/pod_test.go) that may cause a DATA RACE when calling device.InitDevices().
    */
    // device.InitDevices()

        ...
}