Closed eabochasjauregui closed 3 years ago
Hi @eabochasjauregui.
Before digging too much into it too much, how have you deployed the GFD?Via a static daemonset or via helm?
The example at https://github.com/NVIDIA/gpu-feature-discovery/blob/master/deployments/static/gpu-feature-discovery-daemonset-with-mig-mixed.yaml#L31 shows what you need in order to deploy it for use with the MIG mixed strategy (there is also an example there for the single strategy).
The error you are seeing would likely occur if you didn't set the NVIDIA_MIG_MONITOR_DEVICES
environment variable on your container. Without this, the container does not have privileges to read the state of the MIG devices across all GPUs.
Hi @klueska,
Thank you for your reply. We deployed the GFD via Helm. We followed the documentation over at https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html#using-mig-strategies-in-kubernetes, so I don't believe we set the NVIDIA_MIG_MONITOR_DEVICES
environment variable. Would that be set as a value for the GFD chart?
Deploying via helm
should set this for you if you set your migStrategy
to anything other than none
:
https://github.com/NVIDIA/gpu-feature-discovery/blob/master/deployments/helm/gpu-feature-discovery/templates/daemonset.yml#L57
Can you show me the output of:
helm get all gpu-feature-discovery-<release_id>
Sure, here's the output:
NAME: gpu-feature-discovery-1608671054
LAST DEPLOYED: Tue Dec 22 15:04:14 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
migStrategy: mixed
COMPUTED VALUES:
affinity: {}
fullnameOverride: ""
image:
pullPolicy: IfNotPresent
repository: nvidia/gpu-feature-discovery
tag: ""
imagePullSecrets: []
migStrategy: mixed
nameOverride: ""
namespace: node-feature-discovery
nfd:
deploy: true
node-feature-discovery:
fullnameOverride: nfd
global: {}
image:
pullPolicy: IfNotPresent
repository: quay.io/kubernetes_incubator/node-feature-discovery
tag: ""
imagePullSecrets: []
master:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- ""
weight: 1
annotations: {}
extraLabelNs:
- nvidia.com
nodeSelector: {}
podSecurityContext: {}
resources: {}
securityContext: {}
service:
port: 8080
type: ClusterIP
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Equal
value: ""
nameOverride: ""
namespace:
create: true
name: node-feature-discovery
rbac:
create: true
role: ""
serviceAccount:
annotations: {}
create: true
name: ""
worker:
affinity: {}
annotations: {}
nodeSelector: {}
options:
sources:
pci:
deviceLabelFields:
- vendor
podSecurityContext: {}
resources: {}
securityContext: {}
tolerations: {}
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
podSecurityContext: {}
resources: {}
securityContext: {}
sleepInterval: 60s
tolerations: {}
HOOKS:
MANIFEST:
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: node-feature-discovery # NFD namespace
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: nfd-master
namespace: node-feature-discovery
labels:
helm.sh/chart: node-feature-discovery-0.1.0
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/version: "0.6.0"
app.kubernetes.io/managed-by: Helm
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: nfd-master
labels:
helm.sh/chart: node-feature-discovery-0.1.0
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/version: "0.6.0"
app.kubernetes.io/managed-by: Helm
rules:
- apiGroups:
- ""
resources:
- nodes
# when using command line flag --resource-labels to create extended resources
# you will need to uncomment "- nodes/status"
# - nodes/status
verbs:
- get
- patch
- update
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: nfd-master
labels:
helm.sh/chart: node-feature-discovery-0.1.0
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/version: "0.6.0"
app.kubernetes.io/managed-by: Helm
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: nfd-master
subjects:
- kind: ServiceAccount
name: nfd-master
namespace: node-feature-discovery
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: nfd-master
namespace: node-feature-discovery
labels:
helm.sh/chart: node-feature-discovery-0.1.0
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/version: "0.6.0"
app.kubernetes.io/managed-by: Helm
spec:
selector:
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/role: master
ports:
- name: grpc
targetPort: grpc
protocol: TCP
port: 8080
type: ClusterIP
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/worker.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nfd-worker
namespace: node-feature-discovery
labels:
helm.sh/chart: node-feature-discovery-0.1.0
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/version: "0.6.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/role: worker
spec:
selector:
matchLabels:
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/role: worker
template:
metadata:
labels:
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/role: worker
annotations:
{}
spec:
dnsPolicy: ClusterFirstWithHostNet
securityContext:
{}
containers:
- name: worker
image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
imagePullPolicy: IfNotPresent
securityContext:
{}
resources:
{}
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
command:
- "nfd-worker"
args:
- "--sleep-interval=60s"
- --server=nfd-master:8080
- --options={"sources":{"pci":{"deviceLabelFields":["vendor"]}}}
## Enable TLS authentication (1/3)
## The example below assumes having the root certificate named ca.crt stored in
## a ConfigMap named nfd-ca-cert, and, the TLS authentication credentials stored
## in a TLS Secret named nfd-worker-cert
# - "--ca-file=/etc/kubernetes/node-feature-discovery/trust/ca.crt"
# - "--key-file=/etc/kubernetes/node-feature-discovery/certs/tls.key"
# - "--cert-file=/etc/kubernetes/node-feature-discovery/certs/tls.crt"
volumeMounts:
- name: host-boot
mountPath: "/host-boot"
readOnly: true
- name: host-os-release
mountPath: "/host-etc/os-release"
readOnly: true
- name: host-sys
mountPath: "/host-sys"
- name: source-d
mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
- name: features-d
mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
## Enable TLS authentication (2/3)
# - name: nfd-ca-cert
# mountPath: "/etc/kubernetes/node-feature-discovery/trust"
# readOnly: true
# - name: nfd-worker-cert
# mountPath: "/etc/kubernetes/node-feature-discovery/certs"
# readOnly: true
volumes:
- name: host-boot
hostPath:
path: "/boot"
- name: host-os-release
hostPath:
path: "/etc/os-release"
- name: host-sys
hostPath:
path: "/sys"
- name: source-d
hostPath:
path: "/etc/kubernetes/node-feature-discovery/source.d/"
- name: features-d
hostPath:
path: "/etc/kubernetes/node-feature-discovery/features.d/"
## Enable TLS authentication (3/3)
# - name: nfd-ca-cert
# configMap:
# name: nfd-ca-cert
# - name: nfd-worker-cert
# secret:
# secretName: nfd-worker-cert
---
# Source: gpu-feature-discovery/templates/daemonset.yml
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-feature-discovery-1608671054
namespace: node-feature-discovery
labels:
helm.sh/chart: gpu-feature-discovery-0.2.2
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/version: "0.2.2"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: gpu-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
spec:
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
securityContext:
{}
containers:
- image: nvidia/gpu-feature-discovery:v0.2.2
imagePullPolicy: IfNotPresent
name: gpu-feature-discovery
env:
- name: GFD_SLEEP_INTERVAL
value: 60s
- name: GFD_MIG_STRATEGY
value: mixed
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
securityContext:
capabilities:
add:
- SYS_ADMIN
volumeMounts:
- name: output-dir
mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
- name: dmi-product-name
mountPath: "/sys/class/dmi/id/product_name"
volumes:
- name: output-dir
hostPath:
path: "/etc/kubernetes/node-feature-discovery/features.d"
- name: dmi-product-name
hostPath:
path: "/sys/class/dmi/id/product_name"
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nfd-master
namespace: node-feature-discovery
labels:
helm.sh/chart: node-feature-discovery-0.1.0
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/version: "0.6.0"
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/role: master
spec:
replicas:
selector:
matchLabels:
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/role: master
template:
metadata:
labels:
app.kubernetes.io/name: node-feature-discovery
app.kubernetes.io/instance: gpu-feature-discovery-1608671054
app.kubernetes.io/role: master
annotations:
{}
spec:
serviceAccountName: nfd-master
securityContext:
{}
containers:
- name: master
image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
imagePullPolicy: IfNotPresent
securityContext:
{}
resources:
{}
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
ports:
- name: grpc
containerPort: 8080
command:
- "nfd-master"
args:
- --extra-label-ns=nvidia.com
## Enable TLS authentication
## The example below assumes having the root certificate named ca.crt stored in
## a ConfigMap named nfd-ca-cert, and, the TLS authentication credentials stored
## in a TLS Secret named nfd-master-cert.
## Additional hardening can be enabled by specifying --verify-node-name in
## args, in which case every nfd-worker requires a individual node-specific
## TLS certificate.
# - "--ca-file=/etc/kubernetes/node-feature-discovery/trust/ca.crt"
# - "--key-file=/etc/kubernetes/node-feature-discovery/certs/tls.key"
# - "--cert-file=/etc/kubernetes/node-feature-discovery/certs/tls.crt"
# volumeMounts:
# - name: nfd-ca-cert
# mountPath: "/etc/kubernetes/node-feature-discovery/trust"
# readOnly: true
# - name: nfd-master-cert
# mountPath: "/etc/kubernetes/node-feature-discovery/certs"
# readOnly: true
# volumes:
# - name: nfd-ca-cert
# configMap:
# name: nfd-ca-cert
# - name: nfd-master-cert
# secret:
# secretName: nfd-master-cert
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- ""
weight: 1
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Equal
value: ""
Hmm. Everything looks good there.
What version of the NVIDIA container stack do you have installed? I.e. if you are on an ubuntu machine, can you show me the output of:
sudo dpkg -l '*nvidia-container*'
The drivers, stack, and cluster were installed via the DeepOps playbooks. The nodes are RHEL, so we ran sudo rpm -qa '*nvidia-container*'
instead; the output is as follows:
nvidia-container-runtime-3.2.0-1.x86_64
nvidia-container-selinux-20.09-0.el7.noarch
libnvidia-container1-1.2.0-1.x86_64
libnvidia-container-tools-1.2.0-1.x86_64
nvidia-container-toolkit-1.2.1-2.x86_64
Yeah, I think you probably need to update your libnvidia-container
to v1.3.0
.
Without going into to too much detail, v1.3.0
is the first version that supports what's known as /dev
based nvidia-capabilities
(which the newest drivers have turned on by default, and which MIG uses under the hood for almost everything). Versions v1.1.0
and v1.2.0
only worked with what's called /proc
based nvidia-capabilities
, which I'm assuming your driver doesn't have enabled.
More details here: https://docs.google.com/document/d/194A-Hg3mLlIW4eo2BSUcGKpzZf2ciX47eoH5T-WZNXo/edit?ts=5fd54b8c
Alright, that was indeed the issue! We updated libnvidia-container1
and libnvidia-container-tools
to their latest v1.3.1
, and upon restarting the kubelet, device plugin, and GFD, everything worked as expected.
Thank you for your help!
We have been working on trying to set up the device plugin and GFD with MIGs. We managed to configure the MIG profiles, and they can now be seen by nvidia-smi. The node also changed to showing only the number of GPUs we didn’t split in Kubernetes; however, the node was still unable to show the resources corresponding to the MIGs that were configured. Upon closer inspection, we noticed the logs for the GFD were showing an NVML permission error that didn’t seem to appear when MIG wasn’t enabled, as shown in the following screenshot.
Any help with this issue would be greatly appreciated.
Thank you!