What happened:
when 1 pod with 2+ container request gpus, wrong score.device used resources calculated.
assume we have 2 gpu cards, each one has 10Gi memory, there is no task now.
1 pod with 2+ containers, and each container request a gpu (let's say it memory)
container1 request 10Gi memory
container2 request 1Gi memory
we want this result:
container1 got card1
container2 got card2
wrong result now:
container1 got card1
container2 got card1
root reason:
there is a sort function in fitInDevices function, which will change the order of node.Devices
but the result return by fitInCertainDevice records the device index before sorted,
so after fitInCertainDevice, we calculate used, usedcores, usedmem, then add them to a wrong device.
What you expected to happen:
should got correct device when 1 pod with 2+ containers
How to reproduce it (as minimally and precisely as possible):
we metioned above
Anything else we need to know?:
The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The hami-device-plugin container logs
The hami-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
What happened: when 1 pod with 2+ container request gpus, wrong score.device
used resources
calculated.assume we have 2 gpu cards, each one has 10Gi memory, there is no task now. 1 pod with 2+ containers, and each container request a gpu (let's say it memory) container1 request 10Gi memory container2 request 1Gi memory
we want this result: container1 got card1 container2 got card2
wrong result now: container1 got card1 container2 got card1
root reason: there is a
sort
function infitInDevices
function, which will change the order of node.Devices but the result return byfitInCertainDevice
recordsthe device index
before sorted, so afterfitInCertainDevice
, we calculateused
,usedcores
,usedmem
, then add them to a wrong device.What you expected to happen: should got correct device when 1 pod with 2+ containers
How to reproduce it (as minimally and precisely as possible): we metioned above
Anything else we need to know?:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)dmesg
Environment:
docker version
uname -a