NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.76k stars 617 forks source link

When I want to use MPS in Kubernetes, I need to specify --mps-root. #816

Open zbk2012 opened 3 months ago

zbk2012 commented 3 months ago

#################### logs: using mps requires --mps-root to be specified. #################### The contents of the nvidia-device-plugin.yml file are as follows:

...
env:
- name: CONFIG_FILE
  value: "/data/system-yaml/a100-mps.yaml"
...

#################### The contents of the /data/system-yaml/a100-mps.yaml file are as follows:

version: v1
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 2

#################### I have added the following content to the nvidia-device-plugin.yml file:

...
env:
- name: CONFIG_FILE
  value: "/data/system-yaml/a100-mps.yaml"
- name: MPS_ROOT
  value: "/run/nvidia/mps"
...

The container successfully started, but no GPU was found and there is nothing in the /run/nvidia/mps directory.

How to fill in MPS_ROOT?

elezar commented 3 months ago

Hi @zbk2012. From your example, it seems as if your config file is not properly indented. You are probably looking for something like instead:

version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
      replicas: 2

This should also be confirmed by your device plugin logs.

zbk2012 commented 3 months ago

Hi @zbk2012. From your example, it seems as if your config file is not properly indented. You are probably looking for something like instead:

version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
       replicas: 2

This should also be confirmed by your device plugin logs.

Oh, I'm sorry, the indentation was missing when copying. The indentation in the config file is correct.

elezar commented 2 months ago

@zbk2012 could you provide the logs for GFD and the device plugin? For example, I use the following to deploy the plugin:

helm upgrade nvidia -i deployments/helm/nvidia-device-plugin \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set runtimeClassName=nvidia \
    --set config.name=nvidia-plugin-configs \
    --set nvidiaDriverRoot=/ \
    --set gfd.enabled=true

Where the config is created from:

cat << EOF > dp-mps-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy:
    - envvar
    deviceIDStrategy: uuid
sharing:
  mps:
    renameByDefault: false
    resources:
    - name: nvidia.com/gpu
      replicas: 4
EOF

by running:

kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \
    --from-file=config=dp-mps-config.yaml