NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 634 forks source link

Resources are not split when using “time slicing” with the NVIDIA device plugin for Kubernetes #990

Open y-shida-tg opened 1 month ago

y-shida-tg commented 1 month ago

Referring to “GitHub - NVIDIA/k8s-device-plugin: NVIDIA device plugin for Kubernetes”, we have implemented the " NVIDIA device plugin for Kubernetes" and are trying out time slicing, but encountering issues. Specifically, the GPU capacity is displayed as follows, with only “1” GPU capacity shown instead of “4” (expected to be 4 due to replicas: 4 in the YAML). What could be the reason why “Capacity” is not increasing?

# kubectl describe node test-server
Capacity:
  nvidia.com/gpu: 1
  nvidia.com/gpu: 1

times.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: device-plugin-config
data:
  time-sliced: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

Hardware Information: Server: PowerEdge R750 (SKU=090E, ModelName=PowerEdge R750) CPU: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz

GPGPU Information: GPGPU: A100 80GB CUDA Version: 12.2 Driver Version: 535.54.03 nvidia-container-runtime: runc version 1.0.2、spec: 1.0.2-dev、go: go1.16.7、libseccomp: 2.5.1

Linux Information: OS: CentOS Linux release 8.5.2111 k8s environment: kubectl version: Client Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.6”, GitCommit: “ad3338546da947756e8a88aa6822e9c11e7eac22”, GitTreeState: “clean”, BuildDate: “2022-04-14T08:49:13Z”, GoVersion: “go1.17.9”, Compiler: “gc”, Platform: “linux/amd64”} Server Version: version.Info{Major: “1”, Minor: “23”, GitVersion: “v1.23.17”, GitCommit: “953be8927218ec8067e1af2641e540238ffd7576”, GitTreeState: “clean”, BuildDate: “2023-02-22T13:27:46Z”, GoVersion: “go1.19.6”, Compiler: “gc”, Platform: “linux/amd64”} crio version: 1.23.5

NVIDIA device plugin for Kubernetes version used: v0.16.1

klueska commented 1 month ago

The only reason this would happen is if your plugin on the node isn't actually pointing to this config. Did you launch the plugin pointing to this config map and then update the label on the node to point to the particular time-slicing config within that config map?

https://github.com/NVIDIA/k8s-device-plugin/tree/main?tab=readme-ov-file#multiple-config-file-example

y-shida-tg commented 1 month ago

Thank you for your reply.

I created dp-example-config0.yaml and dp-example-config1.yaml, applied the config with helm upgrade -i nvdp nvdp/nvidia-device-plugin, and then started the device plugin with kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml. However, the node information still shows only 1 GPU as follows:

Capacity: nvidia.com/gpu: 1 Allocatable: nvidia.com/gpu: 1

The following config is added to the node labels: Labels: nvidia.com/device-plugin.config=config0

Are there any other items to check in the device plugin or settings to be configured on the node side? Currently, the node side has only been modified for compatibility with the cri-o device plugin.

Detailed file contents and commands are shown below.

Configuration file contents

cat dp-example-config0.yaml

version: v1 flags: migStrategy: "none" failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid sharing: timeSlicing: resources:

cat dp-example-config1.yaml

version: v1 flags: migStrategy: "mixed" # Only change from config0.yaml failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid

Apply configuration file

helm upgrade -i nvdp nvdp/nvidia-device-plugin \

--version=0.16.1 \
--namespace nvidia-device-plugin \
--create-namespace \
--set config.default=config0 \
--set-file config.map.config0=dp-example-config0.yaml \
--set-file config.map.config1=dp-example-config1.yaml

kubectl label nodes onp1-4-r750 --overwrite \

nvidia.com/device-plugin.config=config0

Check the contents of the ConfigMap

kubectl describe configmaps -n nvidia-device-plugin

Name: kube-root-ca.crt Namespace: nvidia-device-plugin Labels: Annotations: kubernetes.io/description: Contains a CA bundle that can be used to verify the kube-apiserver when using internal endpoints such as the internal service IP or kubern...

Data

ca.crt:

-----BEGIN CERTIFICATE----- MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl cm5ldGVzMB4XDTI0MDYxODA4NDQwNFoXDTM0MDYxNjA4NDQwNFowFTETMBEGA1UE AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAN6a txp2J29lwgQ7eEiQ+h2DOXYecFcnodeyXt0jTXy2YacPh7kvt3alZ7bm+NIuDhkt 2dAnx7qJQRSnnM5xEP6bliHjkqVRMDyQf5BqgfLyKf2+usuYyas3dAevtKqI0qFP 5MnoHhUI2z+T5xleCguWxdsl39kQErD8WjWmQ2tR2a1JQOvUE/8QBo4tP0peyBFE BwurzgDwFuaVRjrzREBL1BCzdQbG3XtGCiEyMvcgm2yO1kNcjYibqK5kc5R/zQ31 p/yJRPs4tcQEcRlh62S9HgghhYpQQb1whVaK7mZP3BJ3a+ku7Dp1E8+rnNkVtRgO icItv/Esv57OBX9MNwkCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB /wQFMAMBAf8wHQYDVR0OBBYEFLol1Lsh1L1n76Nz1uay7TkdCYgnMBUGA1UdEQQO MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJz3AlS8e8CoyFxoBp3j b/sbgeL6DXNfOPafOPUvMJrOfTw4ZhXuHmB2kY/dws9hPxSuiVO1Z3woymeYGHrl aIFy1f5d4XtTrsjKWkV9aqcw+UZ4Z4H2R73F8A5VrVAq9zUSre3J45H7QVdAYIdP PUI+uvtg0o+IBKIYZo43uBjMsZm1h2zQe03+Bf8DOQd8WByb/VEWM4/blYLwiMs7 4pvImNdTJChSrL3tbelM/X2M78RYXYXNZqkGw0iIRS07Tv9B688Xx8dUhs5WxjZU 9Ge7VFxK+W8lMjo0V3EFHhbYnS0LwMhuMpAryBpd3tcnktOVBh2lPZO2g6WseOVB RNI= -----END CERTIFICATE-----

BinaryData

Events:

Name: nvdp-nvidia-device-plugin-configs Namespace: nvidia-device-plugin Labels: app.kubernetes.io/instance=nvdp app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=nvidia-device-plugin app.kubernetes.io/version=0.16.1 helm.sh/chart=nvidia-device-plugin-0.16.1 Annotations: meta.helm.sh/release-name: nvdp meta.helm.sh/release-namespace: nvidia-device-plugin

Data

config0:

version: v1 flags: migStrategy: "none" failOnInitError: true nvidiaDriverRoot: "/" plugin: passDeviceSpecs: false deviceListStrategy: envvar deviceIDStrategy: uuid sharing: timeSlicing: resources:

BinaryData

Events:

Node state (without nvidia-device-plugin-daemonset)

kubectl describe nodes onp1-4-r750

Name: onp1-4-r750 Roles: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=onp1-4-r750 kubernetes.io/os=linux nvidia.com/device-plugin.config=config0 Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64 projectcalico.org/IPv6Address: fc00:a000::14/64 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 18 Jun 2024 18:00:26 +0900 Taints: Unschedulable: false Lease: HolderIdentity: onp1-4-r750 AcquireTime: RenewTime: Tue, 15 Oct 2024 18:11:08 +0900 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Wed, 09 Oct 2024 09:41:27 +0900 Wed, 09 Oct 2024 09:41:27 +0900 CalicoIsUp Calico is running on this node MemoryPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 15 Oct 2024 18:09:32 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletReady kubelet is posting ready status Addresses: InternalIP: fc00:a000::14 Hostname: onp1-4-r750 Capacity: cpu: 112 ephemeral-storage: 2737838616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395422092Ki nvidia.com/gpu: 0 pods: 110 Allocatable: cpu: 112 ephemeral-storage: 2523192064328 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395319692Ki nvidia.com/gpu: 0 pods: 110 System Info: Machine ID: d4e91833fac54bb0b9458e38819fdf2b System UUID: 4c4c4544-0046-5110-8051-c3c04f395633 Boot ID: 6de46ba4-46ee-4413-8fde-74cf7ff5473d Kernel Version: 5.10.57 OS Image: CentOS Linux 8 Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.23.5 Kubelet Version: v1.23.6 Kube-Proxy Version: v1.23.6 PodCIDR: 1100:0:0:1::/64 PodCIDRs: 1100:0:0:1::/64,10.244.1.0/24 Non-terminated Pods: (3 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age


kube-system calico-node-m8xcl 250m (0%) 0 (0%) 0 (0%) 0 (0%) 118d kube-system kube-multus-ds-cps4h 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 118d kube-system kube-proxy-zhwt4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 119d Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 350m (0%) 100m (0%) memory 50Mi (0%) 50Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 Events:

■Node status (after starting nvidia-device-plugin-daemonset)

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

kubectl describe nodes onp1-4-r750

Name: onp1-4-r750 Roles: Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=onp1-4-r750 kubernetes.io/os=linux nvidia.com/device-plugin.config=config0 Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock node.alpha.kubernetes.io/ttl: 0 projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64 projectcalico.org/IPv6Address: fc00:a000::14/64 volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 18 Jun 2024 18:00:26 +0900 Taints: Unschedulable: false Lease: HolderIdentity: onp1-4-r750 AcquireTime: RenewTime: Tue, 15 Oct 2024 18:12:09 +0900 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Wed, 09 Oct 2024 09:41:27 +0900 Wed, 09 Oct 2024 09:41:27 +0900 CalicoIsUp Calico is running on this node MemoryPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Tue, 15 Oct 2024 18:12:06 +0900 Fri, 30 Aug 2024 07:40:34 +0900 KubeletReady kubelet is posting ready status Addresses: InternalIP: fc00:a000::14 Hostname: onp1-4-r750 Capacity: cpu: 112 ephemeral-storage: 2737838616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395422092Ki nvidia.com/gpu: 1 pods: 110 Allocatable: cpu: 112 ephemeral-storage: 2523192064328 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395319692Ki nvidia.com/gpu: 1 pods: 110 System Info: Machine ID: d4e91833fac54bb0b9458e38819fdf2b System UUID: 4c4c4544-0046-5110-8051-c3c04f395633 Boot ID: 6de46ba4-46ee-4413-8fde-74cf7ff5473d Kernel Version: 5.10.57 OS Image: CentOS Linux 8 Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.23.5 Kubelet Version: v1.23.6 Kube-Proxy Version: v1.23.6 PodCIDR: 1100:0:0:1::/64 PodCIDRs: 1100:0:0:1::/64,10.244.1.0/24 Non-terminated Pods: (4 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age


kube-system calico-node-m8xcl 250m (0%) 0 (0%) 0 (0%) 0 (0%) 118d kube-system kube-multus-ds-cps4h 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 118d kube-system kube-proxy-zhwt4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 119d kube-system nvidia-device-plugin-daemonset-drdv2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18s Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits


cpu 350m (0%) 100m (0%) memory 50Mi (0%) 50Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 Events:

klueska commented 1 month ago

I'm confused by this step that you reference:

and then started the device plugin with kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

The helm install/upgrade command already starts the device plugin configured to be aware of the configs you point it at. The static deployment from the URL you reference is not aware of these configs and would require a substantical amount of addition code to make it aware of them (which is why helm is the preferred installation method for the plugin).

y-shida-tg commented 3 weeks ago

The helm install/upgrade command already starts the device plugin configured to be aware of the configs you point it at. The staticdeployment from the URL you reference is not aware of these configs and would require a substantical amount of addition code to make it aware of them (which is why helm is the preferred installation method for the plugin).

According to the document, I tried to proceed only with helm operations, but the node information is as follows, and the nvidia-device-plugin was not running.

Capacity: nvidia.com/gpu: 0 Allocatable: nvidia.com/gpu: 0

I believe the issue is that the nvidia-device-plugin does not start with helm operations. Are there any items to check?

Below are the command and configuration details.


##Contents of the config file
# cat dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4

## In the case of using the dp-example-config0.yaml on the bulletin board
-----------------------------------------------------------------------
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

# cat dp-example-config1.yaml
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

## Apply config file
# helm search repo nvdp --devel
NAME                            CHART VERSION   APP VERSION     DESCRIPTION
nvdp/gpu-feature-discovery      0.16.2          0.16.2          A Helm chart for gpu-feature-discovery on Kuber...
nvdp/nvidia-device-plugin       0.16.2          0.16.2          A Helm chart for the nvidia-device-plugin on Ku...
# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.16.2 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set config.default=config0 \
    --set-file config.map.config0=dp-example-config0.yaml \
    --set-file config.map.config1=dp-example-config1.yaml

# kubectl label nodes onp1-4-r750 --overwrite \
    nvidia.com/device-plugin.config=config0

## Checking the contents of the config map
# kubectl describe configmaps -n nvidia-device-plugin
Name:         kube-root-ca.crt
Namespace:    nvidia-device-plugin
Labels:       <none>
Annotations:  kubernetes.io/description:
                Contains a CA bundle that can be used to verify the kube-apiserver when using internal endpoints such as the internal service IP or kubern...

Data
====
ca.crt:
----
-----BEGIN CERTIFICATE-----
MIIC/jCCAeagAwIBAgIBADANBgkqhkiG9w0BAQsFADAVMRMwEQYDVQQDEwprdWJl
cm5ldGVzMB4XDTI0MDYxODA4NDQwNFoXDTM0MDYxNjA4NDQwNFowFTETMBEGA1UE
AxMKa3ViZXJuZXRlczCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAN6a
txp2J29lwgQ7eEiQ+h2DOXYecFcnodeyXt0jTXy2YacPh7kvt3alZ7bm+NIuDhkt
2dAnx7qJQRSnnM5xEP6bliHjkqVRMDyQf5BqgfLyKf2+usuYyas3dAevtKqI0qFP
5MnoHhUI2z+T5xleCguWxdsl39kQErD8WjWmQ2tR2a1JQOvUE/8QBo4tP0peyBFE
BwurzgDwFuaVRjrzREBL1BCzdQbG3XtGCiEyMvcgm2yO1kNcjYibqK5kc5R/zQ31
p/yJRPs4tcQEcRlh62S9HgghhYpQQb1whVaK7mZP3BJ3a+ku7Dp1E8+rnNkVtRgO
icItv/Esv57OBX9MNwkCAwEAAaNZMFcwDgYDVR0PAQH/BAQDAgKkMA8GA1UdEwEB
/wQFMAMBAf8wHQYDVR0OBBYEFLol1Lsh1L1n76Nz1uay7TkdCYgnMBUGA1UdEQQO
MAyCCmt1YmVybmV0ZXMwDQYJKoZIhvcNAQELBQADggEBAJz3AlS8e8CoyFxoBp3j
b/sbgeL6DXNfOPafOPUvMJrOfTw4ZhXuHmB2kY/dws9hPxSuiVO1Z3woymeYGHrl
aIFy1f5d4XtTrsjKWkV9aqcw+UZ4Z4H2R73F8A5VrVAq9zUSre3J45H7QVdAYIdP
PUI+uvtg0o+IBKIYZo43uBjMsZm1h2zQe03+Bf8DOQd8WByb/VEWM4/blYLwiMs7
4pvImNdTJChSrL3tbelM/X2M78RYXYXNZqkGw0iIRS07Tv9B688Xx8dUhs5WxjZU
9Ge7VFxK+W8lMjo0V3EFHhbYnS0LwMhuMpAryBpd3tcnktOVBh2lPZO2g6WseOVB
RNI=
-----END CERTIFICATE-----

BinaryData
====

Events:  <none>

Name:         nvdp-nvidia-device-plugin-configs
Namespace:    nvidia-device-plugin
Labels:       app.kubernetes.io/instance=nvdp
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=nvidia-device-plugin
              app.kubernetes.io/version=0.16.1
              helm.sh/chart=nvidia-device-plugin-0.16.1
Annotations:  meta.helm.sh/release-name: nvdp
              meta.helm.sh/release-namespace: nvidia-device-plugin

Data
====
config0:
----
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4
config1:
----
version: v1
flags:
  migStrategy: "mixed" # Only change from config0.yaml
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid

BinaryData
====

Events:  <none>

## Node status (without nvidia-device-plugin-daemonset)
# kubectl describe nodes onp1-4-r750
Name:               onp1-4-r750
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=onp1-4-r750
                    kubernetes.io/os=linux
                    nvidia.com/device-plugin.config=config0
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
                    projectcalico.org/IPv6Address: fc00:a000::14/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 18 Jun 2024 18:00:26 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  onp1-4-r750
  AcquireTime:     <unset>
  RenewTime:       Tue, 15 Oct 2024 18:11:08 +0900
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 09:41:27 +0900   Wed, 09 Oct 2024 09:41:27 +0900   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 15 Oct 2024 18:09:32 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  fc00:a000::14
  Hostname:    onp1-4-r750
Capacity:
  cpu:                112
  ephemeral-storage:  2737838616Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395422092Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                112
  ephemeral-storage:  2523192064328
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395319692Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 d4e91833fac54bb0b9458e38819fdf2b
  System UUID:                4c4c4544-0046-5110-8051-c3c04f395633
  Boot ID:                    6de46ba4-46ee-4413-8fde-74cf7ff5473d
  Kernel Version:             5.10.57
  OS Image:                   CentOS Linux 8
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.23.5
  Kubelet Version:            v1.23.6
  Kube-Proxy Version:         v1.23.6
PodCIDR:                      1100:0:0:1::/64
PodCIDRs:                     1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods:          (3 in total)
  Namespace                   Name                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-m8xcl       250m (0%)     0 (0%)      0 (0%)           0 (0%)         118d
  kube-system                 kube-multus-ds-cps4h    100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      118d
  kube-system                 kube-proxy-zhwt4        0 (0%)        0 (0%)      0 (0%)           0 (0%)         119d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (0%)  100m (0%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     0          0
Events:              <none>

## Node status (after nvidia-device-plugin-daemonset launch)
# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml

# kubectl describe nodes onp1-4-r750
Name:               onp1-4-r750
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=onp1-4-r750
                    kubernetes.io/os=linux
                    nvidia.com/device-plugin.config=config0
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.84.64
                    projectcalico.org/IPv6Address: fc00:a000::14/64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 18 Jun 2024 18:00:26 +0900
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  onp1-4-r750
  AcquireTime:     <unset>
  RenewTime:       Tue, 15 Oct 2024 18:12:09 +0900
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 09 Oct 2024 09:41:27 +0900   Wed, 09 Oct 2024 09:41:27 +0900   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 15 Oct 2024 18:12:06 +0900   Fri, 30 Aug 2024 07:40:34 +0900   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  fc00:a000::14
  Hostname:    onp1-4-r750
Capacity:
  cpu:                112
  ephemeral-storage:  2737838616Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395422092Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                112
  ephemeral-storage:  2523192064328
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             395319692Ki
  nvidia.com/gpu:     1
  pods:               110
System Info:
  Machine ID:                 d4e91833fac54bb0b9458e38819fdf2b
  System UUID:                4c4c4544-0046-5110-8051-c3c04f395633
  Boot ID:                    6de46ba4-46ee-4413-8fde-74cf7ff5473d
  Kernel Version:             5.10.57
  OS Image:                   CentOS Linux 8
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.23.5
  Kubelet Version:            v1.23.6
  Kube-Proxy Version:         v1.23.6
PodCIDR:                      1100:0:0:1::/64
PodCIDRs:                     1100:0:0:1::/64,10.244.1.0/24
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-m8xcl                       250m (0%)     0 (0%)      0 (0%)           0 (0%)         118d
  kube-system                 kube-multus-ds-cps4h                    100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      118d
  kube-system                 kube-proxy-zhwt4                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         119d
  kube-system                 nvidia-device-plugin-daemonset-drdv2    0 (0%)        0 (0%)      0 (0%)           0 (0%)         18s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests   Limits
  --------           --------   ------
  cpu                350m (0%)  100m (0%)
  memory             50Mi (0%)  50Mi (0%)
  ephemeral-storage  0 (0%)     0 (0%)
  hugepages-1Gi      0 (0%)     0 (0%)
  hugepages-2Mi      0 (0%)     0 (0%)
  nvidia.com/gpu     0          0
Events:              <none>
y-shida-tg commented 5 days ago

We have been trying since then, but have not been able to resolve this issue. If you have anything to try, I would appreciate a reply.