Closed z19311 closed 2 weeks ago
I changed the Nvidia Driver from 550 to 535.161.07. And I cannot install the version 510.108.03 or 535.104.12 to my 4090 which may work the maintainer said.
apiVersion: v1
kind: Pod
metadata:
name: gpu-util-2
spec:
# hostPID: true
restartPolicy: OnFailure
containers:
- name: ubuntu-container
image: hamicore-mnist-cuda12.2-ubuntu20.04-python3.10-lsof-lldb:0823
command: ["python3", "/libvgpu/mnist.py"]
resources:
limits:
nvidia.com/gpu: 1
nvidia.com/gpumem: 3000 # this is enough
nvidia.com/gpucores: 2 # the task need 5% gpu utils actuall, so it will pend normally
After installing the driver, the first time I run the pod, it will pend as following which is right
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 28s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 Insufficient nvidia.com/gpucores, 1 Insufficient nvidia.com/gpumem.
Warning FailedScheduling 27s default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu, 1 Insufficient nvidia.com/gpucores, 1 Insufficient nvidia.com/gpumem.
But when deleting it and run again, the pod is running.
NAME READY STATUS RESTARTS AGE
gpu-util-2 1/1 Running 0 62s
The log here is wrong about the container pid.
shrreg_proc_slot_t *find_proc_by_hostpid(int hostpid) {
int i;
for (i=0;i<region_info.shared_region->proc_num;i++) {
LOG_INFO("hostpid=%d containerpid=%d procs[i].hostpid=%d", hostpid, ®ion_info.shared_region->procs[i], region_info.shared_region->procs[i].hostpid);
//
if (region_info.shared_region->procs[i].hostpid == hostpid)
return ®ion_info.shared_region->procs[i];
}
return NULL;
}
The log is
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
[HAMI-core Info(13:140644689917696:multiprocess_memory_limit.c:907)]: hostpid=0 containerpid=-1578666184 procs[i].hostpid=4880
It seems region_info.shared_region->proc_num
and hostpid
are all wrong.
multiprocess_memory_limit.c
In the testing environment, there is only one process is utilizing the GPU.
In
find_proc_by_hostpid()
, the output of the LOG_INFO ishostpid=20382 containerpid=12 procs[i].hostpid=0
and the conditionregion_info.shared_region->procs[i].hostpid == hostpid
always not work andregion_info.shared_region->procs[i].hostpid
is always equal 0.hostpid=20382 containerpid=12
is correct.The
set_host_pid()
may be not correct, either.Maybe it is because the initialization of
region_info.shared_region->procs[i].hostpid
is incorrect. Where is the code of setting the variable?Could someone know this issue?