Open FrsECM opened 3 months ago
Which driver version are you using?
Does the log of the mps-control-daemon-ctr
show any additional output?
Also to clarify. Is the device plugin deployed using the GPU operator or using the standalone helm chart?
Which driver version are you using?
Does the log of the
mps-control-daemon-ctr
show any additional output?
I'am using version 550 of the driver. I don't have mps-control-daemon-ctr, maybe the problem is here ! Do you have a template to install it without helm ?
At the beggining i was using the plugin deployed with GPU operator (v23.9.2) but i manually overrided the yaml in order to target k8s-device-plugin v0.15.0 instead of v0.14.
I did the installation of the control daemon as an "extra", it's now up and running. I did it with this template :
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvdp-nvidia-device-plugin-mps-control-daemon
namespace: gpu-operator
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: nvdp
app.kubernetes.io/version: "0.15.0"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: nvdp
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: nvdp
annotations:
{}
spec:
priorityClassName: system-node-critical
securityContext:
{}
initContainers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
name: mps-control-daemon-mounts
command: [mps-control-daemon, mount-shm]
securityContext:
privileged: true
volumeMounts:
- name: mps-root
mountPath: /mps
mountPropagation: Bidirectional
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
imagePullPolicy: IfNotPresent
name: mps-control-daemon-ctr
command: [mps-control-daemon]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
securityContext:
privileged: true
volumeMounts:
- name: mps-shm
mountPath: /dev/shm
- name: mps-root
mountPath: /mps
volumes:
- name: mps-root
hostPath:
path: /run/nvidia/mps
type: DirectoryOrCreate
- name: mps-shm
hostPath:
path: /run/nvidia/mps/shm
nodeSelector:
# We only deploy this pod if the following sharing label is applied.
nvidia.com/mps.capable: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/pci-10de.present
operator: In
values:
- "true"
- matchExpressions:
- key: feature.node.kubernetes.io/cpu-model.vendor_id
operator: In
values:
- NVIDIA
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
I also modified the node labels in order to say that mps is enabled :
kubectl label node mitcv01 nvidia.com/mps.capable="true" --overwrite
The daemon starts, but it says that a "strategy" is missing :
How can i update this strategy and setup mps-control to use the same configmap as device plugin ?
kubectl patch clusterpolicy/cluster-policy \
-n gpu-operator --type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "nvidia-sharing-config"}}}}'
You need to supply the same config map / name as for the device plugin. There is also a sidecar that ensures the config is up to date in the same way that the device plugin / gfd does.
Is there a reason that you don't skip the installation of the device plugin in the operator and deploy that using helm? See for example: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit#heading=h.9odbb6smrel8
Great document, thanks a lot !
It was because of a lack of knowledge about how to pass the configuration to the plugin. It seems it works now thanks to your very helpfull document !
Finally i did :
helm install --dry-run gpu-operator --wait -n gpu-operator --create-namespace \
nvidia/gpu-operator --version v23.9.2 \
--set nfd.enabled=false \
--set devicePlugin.enabled=false \
--set gfd.enabled=false \
--set toolkit.enabled=false > nvidia-gpu-operator.yaml
Then for installing MPS :
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.15.0 \
--namespace nvidia-device-plugin \
--create-namespace \
--set gfd.enabled=true \
--set config.default=nvidia-sharing \
--set-file config.map.nvidia-sharing=config/nvidia/config/dp-mps-6.yaml
Thanks again for your help.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
I created a "sharing" config map, including MPS and timeslicing config, in order to switch from one to the other :
There is absolutely no issue with the time slicing :
But if i want to use MPS, i have this issue :
Can you help me to figure out what i did wrong ? Thanks,
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your hostsudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
[x] Docker version from
docker version
[x] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V