NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.73k stars 615 forks source link

Cannot get gpu-feature-discovery working with v0.12.0 helm chart #308

Closed damonmaria closed 2 years ago

damonmaria commented 2 years ago

1. Issue or feature description

Cannot get gpu-feature-discovery working with v0.12.0 helm chart.

2. Steps to reproduce the issue

Enable gfd through and install helm chart:

# helm upgrade --install nvidia-device-plugin --namespace=nvidia --repo=https://nvidia.github.io/k8s-device-plugin nvidia-device-plugin --version=0.12.0 --debug --reset-values--set gfd.enabled=true
...
USER-SUPPLIED VALUES:
gfd:
  enabled: true
...
COMPUTED VALUES:
...
gfd:
  enabled: true
  nameOverride: gpu-feature-discovery
gpu-feature-discovery:
  fullnameOverride: ""
  global: {}
  image:
    pullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/gpu-feature-discovery
    tag: ""
  imagePullSecrets: []
  nameOverride: ""
  noTimestamp: null
  nodeSelector:
    feature.node.kubernetes.io/pci-10de.present: "true"
  podAnnotations: {}
  podSecurityContext: {}
  resources: {}
  securityContext:
    privileged: true
  selectorLabelsOverride: {}
  sleepInterval: null
...
---
# Source: nvidia-device-plugin/templates/gfd.yml
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Aa above the gfd.yaml template only renders the copyright notice, and not the daemonset. Even tho gfd.enabled is set. I have tried setting it through a values.yaml with the same result.

I also presume since there is a single blank line after the copyright that {{- if .Values.gfd.enabled }} is evaluating as false, otherwise there would be more blank lines.

3. Information to attach (optional if deemed irrelevant)

Helm: 3.9.0 nvidia-device-plugin: 0.12.0

I presume other details are not relevant as can be seen above the

klueska commented 2 years ago

Hmm. We have had many users successfully launch gfd in this way and I am also not able to reproduce this myself. Is there any other info you could give me about your setup that may be causing this discrepancy?

klueska commented 2 years ago

One thing I noticed (though the COMPUTED_VALUES we see later seems to suggest it doesn’t matter) is that you have no space between --reset-values and --set

damonmaria commented 2 years ago

Thanks for the response @klueska. I had removed an unnecessary parameter from what I pasted and that's how the formatting for that command line parameter got messed up.

To be clear, here is the full output of a minimal test that shows the issue for me. The issue being the final template (nvidia-device-plugin/templates/gfd.yml) does not have any content beyond the comment header. The same command produces a different result for you?

# helm install nvidia-device-plugin --namespace=nvidia --repo=https://nvidia.github.io/k8s-device-plugin nvidia-device-plugin --version=0.12.0  --debug  --set gfd.enabled=true --dry-run
install.go:178: [debug] Original chart version: "0.12.0"
install.go:195: [debug] CHART PATH: /root/.cache/helm/repository/nvidia-device-plugin-0.12.0.tgz

NAME: nvidia-device-plugin
LAST DEPLOYED: Sun Jun 12 16:47:42 2022
NAMESPACE: nvidia
STATUS: pending-install
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
gfd:
  enabled: true

COMPUTED VALUES:
affinity: {}
allowDefaultNamespace: false
compatWithCPUManager: null
config:
  default: ""
  map: {}
  name: ""
deviceIDStrategy: null
deviceListStrategy: null
failOnInitError: null
fullnameOverride: ""
gfd:
  enabled: true
  nameOverride: gpu-feature-discovery
gpu-feature-discovery:
  fullnameOverride: ""
  global: {}
  image:
    pullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/gpu-feature-discovery
    tag: ""
  imagePullSecrets: []
  nameOverride: ""
  noTimestamp: null
  nodeSelector:
    feature.node.kubernetes.io/pci-10de.present: "true"
  podAnnotations: {}
  podSecurityContext: {}
  resources: {}
  securityContext:
    privileged: true
  selectorLabelsOverride: {}
  sleepInterval: null
image:
  pullPolicy: IfNotPresent
  repository: nvcr.io/nvidia/k8s-device-plugin
  tag: ""
imagePullSecrets: []
legacyDaemonsetAPI: null
migStrategy: null
nameOverride: ""
nfd:
  fullnameOverride: ""
  global: {}
  image:
    pullPolicy: IfNotPresent
    repository: k8s.gcr.io/nfd/node-feature-discovery
  imagePullSecrets: []
  master:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/master
              operator: In
              values:
              - ""
          weight: 1
        - preference:
            matchExpressions:
            - key: node-role.kubernetes.io/control-plane
              operator: In
              values:
              - ""
          weight: 1
    annotations: {}
    deploymentAnnotations: {}
    extraLabelNs:
    - nvidia.com
    nodeSelector: {}
    podSecurityContext: {}
    rbac:
      create: true
    replicaCount: 1
    resourceLabels: []
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
    service:
      port: 8080
      type: ClusterIP
    serviceAccount:
      annotations: {}
      create: true
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Equal
      value: ""
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Equal
      value: ""
  nameOverride: node-feature-discovery
  nodeFeatureRule:
    createCRD: true
  serviceAccount:
    name: node-feature-discovery
  tls:
    certManager: false
    enable: false
  topologyUpdater:
    affinity: {}
    annotations: {}
    createCRDs: false
    enable: false
    nodeSelector: {}
    podSecurityContext: {}
    rbac:
      create: false
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsUser: 0
    serviceAccount:
      annotations: {}
      create: false
    tolerations: []
    updateInterval: 60s
    watchNamespace: '*'
  worker:
    affinity: {}
    annotations: {}
    config:
      sources:
        pci:
          deviceClassWhitelist:
          - "02"
          - "0200"
          - "0207"
          - "0300"
          - "0302"
          deviceLabelFields:
          - vendor
    daemonsetAnnotations: {}
    mountUsrSrc: false
    nodeSelector: {}
    podSecurityContext: {}
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
    serviceAccount:
      annotations: {}
      create: true
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Equal
      value: ""
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Equal
      value: present
nodeSelector: {}
nvidiaDriverRoot: null
podAnnotations: {}
podSecurityContext: {}
resources: {}
runtimeClassName: null
securityContext: {}
selectorLabelsOverride: {}
tolerations:
- key: CriticalAddonsOnly
  operator: Exists
- effect: NoSchedule
  key: nvidia.com/gpu
  operator: Exists
updateStrategy:
  type: RollingUpdate

HOOKS:
MANIFEST:
---
# Source: nvidia-device-plugin/charts/nfd/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nvidia-device-plugin-node-feature-discovery
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: nvidia-device-plugin/charts/nfd/templates/serviceaccount.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nvidia-device-plugin-node-feature-discovery-worker
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
---
# Source: nvidia-device-plugin/charts/nfd/templates/nfd-worker-conf.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-node-feature-discovery-worker-conf
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
data:
  nfd-worker.conf: |-
    sources:
      pci:
        deviceClassWhitelist:
        - "02"
        - "0200"
        - "0207"
        - "0300"
        - "0302"
        deviceLabelFields:
        - vendor
---
# Source: nvidia-device-plugin/charts/nfd/templates/nodefeaturerule-crd.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    controller-gen.kubebuilder.io/version: v0.7.0
  creationTimestamp: null
  name: nodefeaturerules.nfd.k8s-sigs.io
spec:
  group: nfd.k8s-sigs.io
  names:
    kind: NodeFeatureRule
    listKind: NodeFeatureRuleList
    plural: nodefeaturerules
    singular: nodefeaturerule
  scope: Cluster
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        description: NodeFeatureRule resource specifies a configuration for feature-based
          customization of node objects, such as node labeling.
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            description: NodeFeatureRuleSpec describes a NodeFeatureRule.
            properties:
              rules:
                description: Rules is a list of node customization rules.
                items:
                  description: Rule defines a rule for node customization such as
                    labeling.
                  properties:
                    labels:
                      additionalProperties:
                        type: string
                      description: Labels to create if the rule matches.
                      type: object
                    labelsTemplate:
                      description: LabelsTemplate specifies a template to expand for
                        dynamically generating multiple labels. Data (after template
                        expansion) must be keys with an optional value (<key>[=<value>])
                        separated by newlines.
                      type: string
                    matchAny:
                      description: MatchAny specifies a list of matchers one of which
                        must match.
                      items:
                        description: MatchAnyElem specifies one sub-matcher of MatchAny.
                        properties:
                          matchFeatures:
                            description: MatchFeatures specifies a set of matcher
                              terms all of which must match.
                            items:
                              description: FeatureMatcherTerm defines requirements
                                against one feature set. All requirements (specified
                                as MatchExpressions) are evaluated against each element
                                in the feature set.
                              properties:
                                feature:
                                  type: string
                                matchExpressions:
                                  additionalProperties:
                                    description: "MatchExpression specifies an expression
                                      to evaluate against a set of input values. It
                                      contains an operator that is applied when matching
                                      the input and an array of values that the operator
                                      evaluates the input against. \n NB: CreateMatchExpression
                                      or MustCreateMatchExpression() should be used
                                      for     creating new instances. NB: Validate()
                                      must be called if Op or Value fields are modified
                                      or if a new     instance is created from scratch
                                      without using the helper functions."
                                    properties:
                                      op:
                                        description: Op is the operator to be applied.
                                        enum:
                                        - In
                                        - NotIn
                                        - InRegexp
                                        - Exists
                                        - DoesNotExist
                                        - Gt
                                        - Lt
                                        - GtLt
                                        - IsTrue
                                        - IsFalse
                                        type: string
                                      value:
                                        description: Value is the list of values that
                                          the operand evaluates the input against.
                                          Value should be empty if the operator is
                                          Exists, DoesNotExist, IsTrue or IsFalse.
                                          Value should contain exactly one element
                                          if the operator is Gt or Lt and exactly
                                          two elements if the operator is GtLt. In
                                          other cases Value should contain at least
                                          one element.
                                        items:
                                          type: string
                                        type: array
                                    required:
                                    - op
                                    type: object
                                  description: MatchExpressionSet contains a set of
                                    MatchExpressions, each of which is evaluated against
                                    a set of input values.
                                  type: object
                              required:
                              - feature
                              - matchExpressions
                              type: object
                            type: array
                        required:
                        - matchFeatures
                        type: object
                      type: array
                    matchFeatures:
                      description: MatchFeatures specifies a set of matcher terms
                        all of which must match.
                      items:
                        description: FeatureMatcherTerm defines requirements against
                          one feature set. All requirements (specified as MatchExpressions)
                          are evaluated against each element in the feature set.
                        properties:
                          feature:
                            type: string
                          matchExpressions:
                            additionalProperties:
                              description: "MatchExpression specifies an expression
                                to evaluate against a set of input values. It contains
                                an operator that is applied when matching the input
                                and an array of values that the operator evaluates
                                the input against. \n NB: CreateMatchExpression or
                                MustCreateMatchExpression() should be used for     creating
                                new instances. NB: Validate() must be called if Op
                                or Value fields are modified or if a new     instance
                                is created from scratch without using the helper functions."
                              properties:
                                op:
                                  description: Op is the operator to be applied.
                                  enum:
                                  - In
                                  - NotIn
                                  - InRegexp
                                  - Exists
                                  - DoesNotExist
                                  - Gt
                                  - Lt
                                  - GtLt
                                  - IsTrue
                                  - IsFalse
                                  type: string
                                value:
                                  description: Value is the list of values that the
                                    operand evaluates the input against. Value should
                                    be empty if the operator is Exists, DoesNotExist,
                                    IsTrue or IsFalse. Value should contain exactly
                                    one element if the operator is Gt or Lt and exactly
                                    two elements if the operator is GtLt. In other
                                    cases Value should contain at least one element.
                                  items:
                                    type: string
                                  type: array
                              required:
                              - op
                              type: object
                            description: MatchExpressionSet contains a set of MatchExpressions,
                              each of which is evaluated against a set of input values.
                            type: object
                        required:
                        - feature
                        - matchExpressions
                        type: object
                      type: array
                    name:
                      description: Name of the rule.
                      type: string
                    vars:
                      additionalProperties:
                        type: string
                      description: Vars is the variables to store if the rule matches.
                        Variables do not directly inflict any changes in the node
                        object. However, they can be referenced from other rules enabling
                        more complex rule hierarchies, without exposing intermediary
                        output values as labels.
                      type: object
                    varsTemplate:
                      description: VarsTemplate specifies a template to expand for
                        dynamically generating multiple variables. Data (after template
                        expansion) must be keys with an optional value (<key>[=<value>])
                        separated by newlines.
                      type: string
                  required:
                  - name
                  type: object
                type: array
            required:
            - rules
            type: object
        required:
        - spec
        type: object
    served: true
    storage: true
status:
  acceptedNames:
    kind: ""
    plural: ""
  conditions: []
  storedVersions: []
---
# Source: nvidia-device-plugin/charts/nfd/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: nvidia-device-plugin-node-feature-discovery
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - patch
  - update
  - list
- apiGroups:
  - nfd.k8s-sigs.io
  resources:
  - nodefeaturerules
  verbs:
  - get
  - list
  - watch
---
# Source: nvidia-device-plugin/charts/nfd/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nvidia-device-plugin-node-feature-discovery
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nvidia-device-plugin-node-feature-discovery
subjects:
- kind: ServiceAccount
  name: nvidia-device-plugin-node-feature-discovery
  namespace: nvidia
---
# Source: nvidia-device-plugin/charts/nfd/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: nvidia-device-plugin-node-feature-discovery-master
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
    role: master
spec:
  type: ClusterIP
  ports:
    - port: 8080
      targetPort: grpc
      protocol: TCP
      name: grpc
  selector:
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
---
# Source: nvidia-device-plugin/charts/nfd/templates/worker.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name:  nvidia-device-plugin-node-feature-discovery-worker
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
    role: worker
  annotations:
    {}
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: nvidia-device-plugin
      role: worker
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: nvidia-device-plugin
        role: worker
      annotations:
        {}
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      serviceAccountName: nvidia-device-plugin-node-feature-discovery-worker
      securityContext:
        {}
      containers:
      - name: worker
        securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
        image: "k8s.gcr.io/nfd/node-feature-discovery:v0.11.0"
        imagePullPolicy: IfNotPresent
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        resources:
            {}
        command:
        - "nfd-worker"
        args:
        - "--server=nvidia-device-plugin-node-feature-discovery-master:8080"
        volumeMounts:
        - name: host-boot
          mountPath: "/host-boot"
          readOnly: true
        - name: host-os-release
          mountPath: "/host-etc/os-release"
          readOnly: true
        - name: host-sys
          mountPath: "/host-sys"
          readOnly: true
        - name: host-usr-lib
          mountPath: "/host-usr/lib"
          readOnly: true
        - name: source-d
          mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
          readOnly: true
        - name: features-d
          mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
          readOnly: true
        - name: nfd-worker-conf
          mountPath: "/etc/kubernetes/node-feature-discovery"
          readOnly: true
      volumes:
        - name: host-boot
          hostPath:
            path: "/boot"
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
        - name: host-sys
          hostPath:
            path: "/sys"
        - name: host-usr-lib
          hostPath:
            path: "/usr/lib"
        - name: source-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/source.d/"
        - name: features-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d/"
        - name: nfd-worker-conf
          configMap:
            name: nvidia-device-plugin-node-feature-discovery-worker-conf
            items:
              - key: nfd-worker.conf
                path: nfd-worker.conf
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Equal
          value: present
---
# Source: nvidia-device-plugin/templates/daemonset.yml
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  labels:
    helm.sh/chart: nvidia-device-plugin-0.12.0
    app.kubernetes.io/name: nvidia-device-plugin
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "0.12.0"
    app.kubernetes.io/managed-by: Helm
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: nvidia-device-plugin
      app.kubernetes.io/instance: nvidia-device-plugin
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app.kubernetes.io/name: nvidia-device-plugin
        app.kubernetes.io/instance: nvidia-device-plugin
      annotations:
        rollme: "7FqcP"
    spec:
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      securityContext:
        {}
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.12.0
        imagePullPolicy: IfNotPresent
        name: nvidia-device-plugin-ctr
        env:
          - name: NVIDIA_MIG_MONITOR_DEVICES
            value: all
        securityContext:
          capabilities:
                add:
                  - SYS_ADMIN
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
      tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - effect: NoSchedule
          key: nvidia.com/gpu
          operator: Exists
---
# Source: nvidia-device-plugin/charts/nfd/templates/master.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name:  nvidia-device-plugin-node-feature-discovery-master
  labels:
    helm.sh/chart: nfd-0.11.0
    app.kubernetes.io/name: node-feature-discovery
    app.kubernetes.io/instance: nvidia-device-plugin
    app.kubernetes.io/version: "v0.11.0"
    app.kubernetes.io/managed-by: Helm
    role: master
  annotations:
    {}
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: node-feature-discovery
      app.kubernetes.io/instance: nvidia-device-plugin
      role: master
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-feature-discovery
        app.kubernetes.io/instance: nvidia-device-plugin
        role: master
      annotations:
        {}
    spec:
      serviceAccountName: nvidia-device-plugin-node-feature-discovery
      securityContext:
        {}
      containers:
        - name: master
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
          image: "k8s.gcr.io/nfd/node-feature-discovery:v0.11.0"
          imagePullPolicy: IfNotPresent
          livenessProbe:
            exec:
              command:
              - "/usr/bin/grpc_health_probe"
              - "-addr=:8080"
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            exec:
              command:
              - "/usr/bin/grpc_health_probe"
              - "-addr=:8080"
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 10
          ports:
          - containerPort: 8080
            name: grpc
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          command:
            - "nfd-master"
          resources:
            {}
          args:
            - "--extra-label-ns=nvidia.com"
            ## By default, disable NodeFeatureRules controller for other than the default instances
            - "-featurerules-controller=true"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: In
                values:
                - ""
            weight: 1
          - preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: In
                values:
                - ""
            weight: 1
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Equal
          value: ""
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
          operator: Equal
          value: ""
---
# Source: nvidia-device-plugin/templates/gfd.yml
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
damonmaria commented 2 years ago

Also, I currently have v0.11.0 of the device plugin, and gpu-feature-discovery (separately) installed and working using helm.

klueska commented 2 years ago

Thanks for reporting this. I have found the issue and will push a patch release tomorrow. I'm not sure how this slipped through our testing.

This is the fix (a version bump was missed for the internal GFD subchart):

$ git diff
diff --git a/deployments/helm/nvidia-device-plugin/charts/gpu-feature-discovery/Chart.yaml b/deployments/helm/nvidia-device-plugin/charts/gpu-feature-discovery/Chart.yaml
index 85a3a838..441fe64a 100644
--- a/deployments/helm/nvidia-device-plugin/charts/gpu-feature-discovery/Chart.yaml
+++ b/deployments/helm/nvidia-device-plugin/charts/gpu-feature-discovery/Chart.yaml
@@ -2,7 +2,7 @@ apiVersion: v2
 name: gpu-feature-discovery
 type: application
 description: A Helm chart for gpu-feature-discovery on Kubernetes
-version: "0.6.0-rc.1"
-appVersion: "0.6.0-rc.1"
+version: "0.6.0"
+appVersion: "0.6.0"
 kubeVersion: ">= 1.10.0-0"
 home: https://github.com/NVIDIA/gpu-feature-discovery

In the meantime, you can use --version=0.12.0-rc.6 since the bits (other than version bumping) are identical.

klueska commented 2 years ago

Plugin v0.12.1 has been released and should address this.

damonmaria commented 2 years ago

Can confirm issue solved with v0.12.1.

Thanks for the quick response.