Project-HAMi / HAMi-core

HAMi-core compiles libvgpu.so, which ensures hard limit on GPU in container
72 stars 37 forks source link

The initialization of the variable 'region_info.shared_region->procs[i].hostpid' in 'multiprocess_memory_limit.c' is incorrect #15

Closed z19311 closed 2 weeks ago

z19311 commented 3 weeks ago

multiprocess_memory_limit.c

shrreg_proc_slot_t *find_proc_by_hostpid(int hostpid) {
    int i;
    for (i=0;i<region_info.shared_region->proc_num;i++) {
        LOG_INFO("hostpid=%d containerpid=%d procs[i].hostpid=%d", hostpid, region_info.shared_region->procs[i], region_info.shared_region->procs[i].hostpid);
        // The output is: hostpid=20382 containerpid=12 procs[i].hostpid=0
        if (region_info.shared_region->procs[i].hostpid == hostpid) 
            return &region_info.shared_region->procs[i];
    }
    return NULL;
}

int set_host_pid(int hostpid) {
    int i,j,found=0;
    for (i=0;i<region_info.shared_region->proc_num;i++){
    LOG_INFO("set_host_pid: region_info.shared_region->procs[i].pid=%d getpid=%d region_info.shared_region->procs[i].hostpid=%d\n", region_info.shared_region->procs[i].pid, getpid(), region_info.shared_region->procs[i].hostpid);
        //  The output is: set_host_pid: region_info.shared_region->procs[i].pid=12 getpid=12 region_info.shared_region->procs[i].hostpid=0
        if (region_info.shared_region->procs[i].pid == getpid()){
            LOG_INFO("SET PID= %d",hostpid);
            //  The output is: SET PID= 12
            found=1;
            region_info.shared_region->procs[i].hostpid = hostpid;
            LOG_INFO("After setting, hostpid for process %d is %d", region_info.shared_region->procs[i].pid, region_info.shared_region->procs[i].hostpid);
            //  The output is: After setting, hostpid for process 12 is 12
            for (j=0;j<CUDA_DEVICE_MAX_COUNT;j++)
                region_info.shared_region->procs[i].monitorused[j]=0;
        }
    }
    if (!found) {
        LOG_ERROR("HOST PID NOT FOUND. %d",hostpid);
        return -1;
    }
    setspec();
    return 0;
}

In the testing environment, there is only one process is utilizing the GPU.

In find_proc_by_hostpid(), the output of the LOG_INFO is hostpid=20382 containerpid=12 procs[i].hostpid=0 and the condition region_info.shared_region->procs[i].hostpid == hostpid always not work and region_info.shared_region->procs[i].hostpid is always equal 0. hostpid=20382 containerpid=12 is correct.

The set_host_pid() may be not correct, either.

Maybe it is because the initialization of region_info.shared_region->procs[i].hostpid is incorrect. Where is the code of setting the variable?

Could someone know this issue?

z19311 commented 2 weeks ago

I changed the Nvidia Driver from 550 to 535.161.07. And I cannot install the version 510.108.03 or 535.104.12 to my 4090 which may work the maintainer said.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-util-2
spec:
#  hostPID: true
  restartPolicy: OnFailure
  containers:
    - name: ubuntu-container
      image: hamicore-mnist-cuda12.2-ubuntu20.04-python3.10-lsof-lldb:0823
      command: ["python3", "/libvgpu/mnist.py"]
      resources:
        limits:
          nvidia.com/gpu: 1
          nvidia.com/gpumem: 3000 # this is enough
          nvidia.com/gpucores: 2 # the task need 5% gpu utils actuall, so it will pend normally

After installing the driver, the first time I run the pod, it will pend as following which is right

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  28s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 Insufficient nvidia.com/gpucores, 1 Insufficient nvidia.com/gpumem.
  Warning  FailedScheduling  27s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 Insufficient nvidia.com/gpucores, 1 Insufficient nvidia.com/gpumem.

But when deleting it and run again, the pod is running.

NAME         READY   STATUS    RESTARTS   AGE
gpu-util-2   1/1     Running   0          62s

The log here is wrong about the container pid.

shrreg_proc_slot_t *find_proc_by_hostpid(int hostpid) {
    int i;
    for (i=0;i<region_info.shared_region->proc_num;i++) {
        LOG_INFO("hostpid=%d containerpid=%d procs[i].hostpid=%d", hostpid, &region_info.shared_region->procs[i], region_info.shared_region->procs[i].hostpid);
        // 
        if (region_info.shared_region->procs[i].hostpid == hostpid) 
            return &region_info.shared_region->procs[i];
    }
    return NULL;
}

The log is

[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880

It seems region_info.shared_region->proc_num and hostpid are all wrong.