Closed t-ibayashi-safie closed 1 year ago
pluginapi.AllocateRequest
does not container information about current pod/container so I think it is not trivial to add CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
env variable to this plugin.
Limitations:
I am working on a fork to support it though (hopefully)
Currently I have this:
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
on client level so each container can have different amount of SM units.CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
as env variable to container to limit GPU memory usage.apiVersion: v1
kind: Pod
metadata:
name: nvidia-device-query
spec:
hostIPC: true
containers:
- name: nvidia-device-query
image: ghcr.io/kuartis/nvidia-device-query:1.0.0
command: ["/bin/sh", "-ec", "while :; do echo '.'; sleep 5 ; done"]
env:
- name: CUDA_MPS_PINNED_DEVICE_MEM_LIMIT
value: 0=2G
resources:
limits:
k8s.kuartis.com/vgpu: '1'
volumeMounts:
- name: nvidia-mps
mountPath: /tmp/nvidia-mps
volumes:
- name: nvidia-mps
hostPath:
path: /tmp/nvidia-mps
What I plan is to create a new resource definition inside same plugin and make both Allocate methods to talk each other via channels.
resources:
limits:
k8s.kuartis.com/vgpu: '1'
k8s.kuartis.com/vgpu-mem: '1024' # This will set correct env variable for container
Here is the link: https://github.com/kuartis/kuartis-virtual-gpu-device-plugin
Thank you for providing an answer to my question.
If the cuda version of each pod is 11.5 or higher, your repository can limit the memory without relying on tensorflow, right?
What I plan is to create a new resource definition inside same plugin and make both Allocate methods to talk each other via channels.
Amazing. I'm looking forward to using this :)
Thank you for providing an answer to my question.
If the cuda version of each pod is 11.5 or higher, your repository can limit the memory without relying on tensorflow, right?
What I plan is to create a new resource definition inside same plugin and make both Allocate methods to talk each other via channels.
Amazing. I'm looking forward to using this :)
Yes, It does limit the memory usage of the container. It even OOMs if you give low amounts.
Present Status
I understand the current system configuration as follows:
My Suggestion