The metagpu device plugin (mgdp
) allows you to share one or more Nvidia GPUs between
different K8s workloads.
K8s doesn't provide a support for the GPU sharing. Meaning user must allocate entire GPU to his workload, even if the actual GPU usage is much bellow of 100%. This project will help to improve the GPU utilization by allowing GPU sharing between multiple K8s workloads.
The mgdp
is based on Nvidia Container Runtime
and on go-nvml
One for the features the nvidia container runtime providers, is an ability
to specify the visible GPU devices Ids by using env vars NVIDIA_VISIBLE_DEVICES
The most short & simple explanation of the mgdp
logic is:
detects all the GPU devices Ids mgdp
advertise these meta-devices Ids to the K8smgdp
will allocate 50 meta-devices IDsIn addition, each metagpu container will have mgctl
The mgctl
is an alternative for nvidia-smi
The mgctl
improves security and provides better K8s integration.
By default, mgdp
will share each of your GPU devices to 100 meta-gpus.
For example, if you've a machine with 2 GPUs, mgdp
will generate 200 metagpus.
Requesting for 50 metagpus, will give you 0.5 GPU, requesting 150 metagpus,
will give you 1.5 metagpus.
# cd into cloned directory and run
# for openshift set ocp=true
helm install chart --set ocp=false -ncnvrg
# cd into cloned directory and run
# for openshift set ocp=true
helm template chart --set ocp=false -ncnvrg > meatgpu.yaml
kubectl apply -f meatgpu.yaml
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
name: metagpu-test
namespace: cnvrg
- operator: "Exists"
- name: gpu-test-with-gpu
image: tensorflow/tensorflow:latest-gpu
- /usr/local/bin/python
- -c
- |
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
while True:
print(tf.reduce_sum(tf.random.normal([1000, 1000])))
limits: "30"