Open xiaoxiaoboyyds opened 4 months ago
My Deployment:
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations: {}
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
name: video
namespace: edge
resourceVersion: '1597692'
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00'
creationTimestamp: null
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
spec:
containers:
- image: 'harbor.moolink.net/moolink/video-supervision:v1.2'
imagePullPolicy: IfNotPresent
name: video
resources:
limits:
nvidia.com/gpu: '1'
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: shm
dnsPolicy: ClusterFirst
nodeName: edgenode-test
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /dev/shm
type: Directory
name: shm
My Deployment:
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations: {}
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
name: video
namespace: edge
resourceVersion: '1597692'
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 1
selector:
matchLabels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00'
creationTimestamp: null
labels:
k8s.kuboard.cn/layer: svc
k8s.kuboard.cn/name: video
spec:
containers:
- image: 'harbor.moolink.net/moolink/video-supervision:v1.2'
imagePullPolicy: IfNotPresent
name: video
resources:
limits:
nvidia.com/gpu: '1'
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/shm
name: shm
dnsPolicy: ClusterFirst
nodeName: edgenode-test
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /dev/shm
type: Directory
name: shm
Meet a similar issue and find a comment at https://github.com/NVIDIA/k8s-device-plugin/issues/467#issuecomment-1974252052 Remove /dev/shm from your deployment and try again. Could you report back if this works or not?
My Deployment:
--- apiVersion: apps/v1 kind: Deployment metadata: annotations: {} labels: k8s.kuboard.cn/layer: svc k8s.kuboard.cn/name: video name: video namespace: edge resourceVersion: '1597692' spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 1 selector: matchLabels: k8s.kuboard.cn/layer: svc k8s.kuboard.cn/name: video strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: kubectl.kubernetes.io/restartedAt: '2024-07-08T17:37:11+08:00' creationTimestamp: null labels: k8s.kuboard.cn/layer: svc k8s.kuboard.cn/name: video spec: containers: - image: 'harbor.moolink.net/moolink/video-supervision:v1.2' imagePullPolicy: IfNotPresent name: video resources: limits: nvidia.com/gpu: '1' terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /dev/shm name: shm dnsPolicy: ClusterFirst nodeName: edgenode-test restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 volumes: - hostPath: path: /dev/shm type: Directory name: shm
~Running with MPS requires you set hostPID: true
in the Pod spec field which I don't see you've done. I suspect this would resolve the issue.~ Correction: This is only required in GKE.
why after
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--set-file config.map.config=config.yaml
no a nvidia-device-plugin-mps-control-daemon
pod?
kubectl get pod -n nvidia-device-plugin
NAME READY STATUS RESTARTS AGE
nvdp-nvidia-device-plugin-fmhlh 2/2 Running 0 21m
and nvidia.com/gpu: 0
Capacity:
cpu: 16
ephemeral-storage: 309506092Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131500512Ki
nvidia.com/gpu: 0
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 285240813915
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 131193312Ki
nvidia.com/gpu: 0
pods: 110
use nvidia-device-plugin-0.16.1
Hi any update on mps shm configurable in last releases? we cannot use it like this each of workload is using different shm
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
After successfully deploying nvidia device plugin using the command, I also successfully made the GPU 10 schedulable. However, when using MPS mode in the container, YOLO found that it was unable to successfully call the GPU when calling resources. But timeSlicing is actually possible。 Do I need to enable certain functions? I have already opened “nvidia cuda mps control - d”, and using “nvidia smi” in the container can also view GPU resources
3. Information to attach (optional if deemed irrelevant)
/