Issue with GFD and MIGs.

eabochasjauregui commented 3 years ago

We have been working on trying to set up the device plugin and GFD with MIGs. We managed to configure the MIG profiles, and they can now be seen by nvidia-smi. The node also changed to showing only the number of GPUs we didn’t split in Kubernetes; however, the node was still unable to show the resources corresponding to the MIGs that were configured. Upon closer inspection, we noticed the logs for the GFD were showing an NVML permission error that didn’t seem to appear when MIG wasn’t enabled, as shown in the following screenshot.

Any help with this issue would be greatly appreciated.

Thank you!

klueska commented 3 years ago

Hi @eabochasjauregui.

Before digging too much into it too much, how have you deployed the GFD?Via a static daemonset or via helm?

The example at https://github.com/NVIDIA/gpu-feature-discovery/blob/master/deployments/static/gpu-feature-discovery-daemonset-with-mig-mixed.yaml#L31 shows what you need in order to deploy it for use with the MIG mixed strategy (there is also an example there for the single strategy).

The error you are seeing would likely occur if you didn't set the NVIDIA_MIG_MONITOR_DEVICES environment variable on your container. Without this, the container does not have privileges to read the state of the MIG devices across all GPUs.

eabochasjauregui commented 3 years ago

Hi @klueska,

Thank you for your reply. We deployed the GFD via Helm. We followed the documentation over at https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html#using-mig-strategies-in-kubernetes, so I don't believe we set the NVIDIA_MIG_MONITOR_DEVICES environment variable. Would that be set as a value for the GFD chart?

klueska commented 3 years ago

Deploying via helm should set this for you if you set your migStrategy to anything other than none: https://github.com/NVIDIA/gpu-feature-discovery/blob/master/deployments/helm/gpu-feature-discovery/templates/daemonset.yml#L57

Can you show me the output of:

helm get all gpu-feature-discovery-<release_id>

eabochasjauregui commented 3 years ago

Sure, here's the output:

NAME: gpu-feature-discovery-1608671054
LAST DEPLOYED: Tue Dec 22 15:04:14 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
migStrategy: mixed

COMPUTED VALUES:
affinity: {}
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
 repository: nvidia/gpu-feature-discovery
  tag: ""
imagePullSecrets: []
migStrategy: mixed
nameOverride: ""
namespace: node-feature-discovery
nfd:
  deploy: true
node-feature-discovery:
  fullnameOverride: nfd
  global: {}
  image:
    pullPolicy: IfNotPresent
    repository: quay.io/kubernetes_incubator/node-feature-discovery
    tag: ""
  imagePullSecrets: []
  master:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
           matchExpressions:
            - key: node-role.kubernetes.io/master
              operator: In
              values:
              - ""
          weight: 1
    annotations: {}
    extraLabelNs:
    - nvidia.com
    nodeSelector: {}
    podSecurityContext: {}
    resources: {}
    securityContext: {}
    service:
      port: 8080
      type: ClusterIP
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Equal
      value: ""
  nameOverride: ""
  namespace:
    create: true
    name: node-feature-discovery
  rbac:
    create: true
    role: ""
  serviceAccount:
    annotations: {}
    create: true
    name: ""
  worker:
    affinity: {}
    annotations: {}
    nodeSelector: {}
    options:
      sources:
        pci:
          deviceLabelFields:
          - vendor
    podSecurityContext: {}
    resources: {}
    securityContext: {}
    tolerations: {}
nodeSelector:
  feature.node.kubernetes.io/pci-10de.present: "true"
podSecurityContext: {}
resources: {}
securityContext: {}
sleepInterval: 60s
tolerations: {}

HOOKS:
MANIFEST:
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: node-feature-discovery # NFD namespace
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nfd-master
  namespace: node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.1.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/version: "0.6.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: nfd-master
  labels:
    helm.sh/chart: node-feature-discovery-0.1.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/version: "0.6.0"
    app.kubernetes.io/managed-by: Helm
rules:
- apiGroups:
  - ""
  resources:
  - nodes
# when using command line flag --resource-labels to create extended resources
# you will need to uncomment "- nodes/status"
# - nodes/status
  verbs:
  - get
  - patch
  - update
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nfd-master
  labels:
    helm.sh/chart: node-feature-discovery-0.1.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/version: "0.6.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nfd-master
subjects:
- kind: ServiceAccount
  name: nfd-master
  namespace: node-feature-discovery
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: nfd-master
  namespace: node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.1.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/version: "0.6.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/role: master
  ports:
  - name: grpc
    targetPort: grpc
    protocol: TCP
    port: 8080
  type: ClusterIP
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/worker.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name:  nfd-worker
  namespace: node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.1.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/version: "0.6.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/role: worker
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: gpu-feature-discovery-1608671054
      app.kubernetes.io/role: worker
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: gpu-feature-discovery-1608671054
        app.kubernetes.io/role: worker
      annotations:
        {}
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      securityContext:
        {}
      containers:
        - name: worker
          image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
          imagePullPolicy: IfNotPresent
          securityContext:
            {}
          resources:
            {}
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          command:
            - "nfd-worker"
          args:
            - "--sleep-interval=60s"
            - --server=nfd-master:8080
            - --options={"sources":{"pci":{"deviceLabelFields":["vendor"]}}}
## Enable TLS authentication (1/3)
## The example below assumes having the root certificate named ca.crt stored in
## a ConfigMap named nfd-ca-cert, and, the TLS authentication credentials stored
## in a TLS Secret named nfd-worker-cert
#            - "--ca-file=/etc/kubernetes/node-feature-discovery/trust/ca.crt"
#            - "--key-file=/etc/kubernetes/node-feature-discovery/certs/tls.key"
#            - "--cert-file=/etc/kubernetes/node-feature-discovery/certs/tls.crt"
          volumeMounts:
            - name: host-boot
              mountPath: "/host-boot"
              readOnly: true
            - name: host-os-release
              mountPath: "/host-etc/os-release"
              readOnly: true
            - name: host-sys
              mountPath: "/host-sys"
            - name: source-d
              mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
            - name: features-d
              mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
## Enable TLS authentication (2/3)
#            - name: nfd-ca-cert
#              mountPath: "/etc/kubernetes/node-feature-discovery/trust"
#              readOnly: true
#            - name: nfd-worker-cert
#              mountPath: "/etc/kubernetes/node-feature-discovery/certs"
#              readOnly: true
      volumes:
        - name: host-boot
          hostPath:
            path: "/boot"
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
        - name: host-sys
          hostPath:
            path: "/sys"
        - name: source-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/source.d/"
        - name: features-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d/"
## Enable TLS authentication (3/3)
#        - name: nfd-ca-cert
#          configMap:
#            name: nfd-ca-cert
#        - name: nfd-worker-cert
#          secret:
#            secretName: nfd-worker-cert
---
# Source: gpu-feature-discovery/templates/daemonset.yml
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-feature-discovery-1608671054
  namespace: node-feature-discovery
  labels:
    helm.sh/chart: gpu-feature-discovery-0.2.2
    app.kubernetes.io/name: gpu-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/version: "0.2.2"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: gpu-feature-discovery
      app.kubernetes.io/instance: gpu-feature-discovery-1608671054
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: gpu-feature-discovery
        app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    spec:
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      securityContext:
        {}
      containers:
        - image: nvidia/gpu-feature-discovery:v0.2.2
          imagePullPolicy: IfNotPresent
          name: gpu-feature-discovery
          env:
            - name: GFD_SLEEP_INTERVAL
              value: 60s
            - name: GFD_MIG_STRATEGY
              value: mixed
            - name: NVIDIA_MIG_MONITOR_DEVICES
              value: all
          securityContext:
            capabilities:
              add:
                - SYS_ADMIN
          volumeMounts:
            - name: output-dir
              mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
            - name: dmi-product-name
              mountPath: "/sys/class/dmi/id/product_name"
      volumes:
        - name: output-dir
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d"
        - name: dmi-product-name
          hostPath:
            path: "/sys/class/dmi/id/product_name"
      nodeSelector:
        feature.node.kubernetes.io/pci-10de.present: "true"
---
# Source: gpu-feature-discovery/charts/node-feature-discovery/templates/master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfd-master
  namespace: node-feature-discovery
  labels:
    helm.sh/chart: node-feature-discovery-0.1.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: gpu-feature-discovery-1608671054
    app.kubernetes.io/version: "0.6.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/role: master
spec:
  replicas:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: gpu-feature-discovery-1608671054
      app.kubernetes.io/role: master
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: gpu-feature-discovery-1608671054
        app.kubernetes.io/role: master
      annotations:
        {}
    spec:
      serviceAccountName: nfd-master
      securityContext:
        {}
      containers:
        - name: master
          image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
          imagePullPolicy: IfNotPresent
          securityContext:
            {}
          resources:
            {}
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          ports:
          - name: grpc
            containerPort: 8080
          command:
            - "nfd-master"
          args:
            - --extra-label-ns=nvidia.com
## Enable TLS authentication
## The example below assumes having the root certificate named ca.crt stored in
## a ConfigMap named nfd-ca-cert, and, the TLS authentication credentials stored
## in a TLS Secret named nfd-master-cert.
## Additional hardening can be enabled by specifying --verify-node-name in
## args, in which case every nfd-worker requires a individual node-specific
## TLS certificate.
#            - "--ca-file=/etc/kubernetes/node-feature-discovery/trust/ca.crt"
#            - "--key-file=/etc/kubernetes/node-feature-discovery/certs/tls.key"
#            - "--cert-file=/etc/kubernetes/node-feature-discovery/certs/tls.crt"
#          volumeMounts:
#            - name: nfd-ca-cert
#              mountPath: "/etc/kubernetes/node-feature-discovery/trust"
#              readOnly: true
#            - name: nfd-master-cert
#              mountPath: "/etc/kubernetes/node-feature-discovery/certs"
#              readOnly: true
#      volumes:
#        - name: nfd-ca-cert
#          configMap:
#            name: nfd-ca-cert
#        - name: nfd-master-cert
#          secret:
#            secretName: nfd-master-cert
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: In
                values:
                - ""
            weight: 1
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""

klueska commented 3 years ago

Hmm. Everything looks good there.

What version of the NVIDIA container stack do you have installed? I.e. if you are on an ubuntu machine, can you show me the output of:

sudo dpkg -l '*nvidia-container*'

eabochasjauregui commented 3 years ago

The drivers, stack, and cluster were installed via the DeepOps playbooks. The nodes are RHEL, so we ran sudo rpm -qa '*nvidia-container*' instead; the output is as follows:

nvidia-container-runtime-3.2.0-1.x86_64
nvidia-container-selinux-20.09-0.el7.noarch
libnvidia-container1-1.2.0-1.x86_64
libnvidia-container-tools-1.2.0-1.x86_64
nvidia-container-toolkit-1.2.1-2.x86_64

klueska commented 3 years ago

Yeah, I think you probably need to update your libnvidia-container to v1.3.0.

Without going into to too much detail, v1.3.0 is the first version that supports what's known as /dev based nvidia-capabilities (which the newest drivers have turned on by default, and which MIG uses under the hood for almost everything). Versions v1.1.0 and v1.2.0 only worked with what's called /proc based nvidia-capabilities, which I'm assuming your driver doesn't have enabled.

More details here: https://docs.google.com/document/d/194A-Hg3mLlIW4eo2BSUcGKpzZf2ciX47eoH5T-WZNXo/edit?ts=5fd54b8c

eabochasjauregui commented 3 years ago

Alright, that was indeed the issue! We updated libnvidia-container1 and libnvidia-container-tools to their latest v1.3.1, and upon restarting the kubelet, device plugin, and GFD, everything worked as expected.

Thank you for your help!

NVIDIA / gpu-feature-discovery

Issue with GFD and MIGs. #6