4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

Segmentation fault (core dumped) #14

Closed bingMillion closed 2 years ago

bingMillion commented 2 years ago

1. Issue or feature description

当我使用示例进行实验时,报错Segmentation fault (core dumped)。 卡片种类为NVIDIA Corporation GP104GL [Tesla P4] (rev a1)

2. Steps to reproduce the issue

1、修改https://raw.githubusercontent.com/4paradigm/k8s-device-plugin/master/nvidia-device-plugin.yml文件, "--device-split-count=3", "--device-memory-scaling=1", "--device-cores-scaling=1"

2、kubectl apply -f nvidia-device-plugin.yml

3、部署 apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers:

[root@node1 4p]# kubectl exec -it gpu-pod /bin/sh

nvidia-smi

Segmentation fault (core dumped)

是gpu需要做额外设置吗? 还是因为操作系统本身是centos76引起的?

3 尝试

我不知道是不是更深层次的原因例如so文件在处理1个pod分配2vgpu 有些问题导致的。 但在设备插件层面这样修改,能够解决。

for i, vd := range vdevices {
    if i != 0{       //  新增部分:直接只遍历一次
        break
    }

    limitKey := fmt.Sprintf("CUDA_DEVICE_MEMORY_LIMIT_%v", i)
       // 新增: 直接给第一个分配内存* len数量
    response.Envs[limitKey] = fmt.Sprintf("%vm", vd.memory * uint64(len(vdevices))) 
    mapEnvs = append(mapEnvs, fmt.Sprintf("%v:%v", i, vd.dev.ID))
}
response.Envs["CUDA_DEVICE_SM_LIMIT"] = 
// 新增: 这里也乘以 要分配的vgpu数量
strconv.Itoa(int(100 * global.DeviceCoresScalingFlag / float64(global.DeviceSplitCountFlag) )*len(vdevices) )