NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Apache License 2.0
287 stars 47 forks source link

WSL2 support #64

Closed davidshen84 closed 11 months ago

davidshen84 commented 11 months ago

Hi,

My k8s master node lives in a WSL2 instance. I configured the Nvidia container runtime and I am able to access the GPU in my WSL2 environment.

I installed gfd using the nvidia-device-plugin helm chart with the following values:

config:
  map:
    default: |
      version: v1
      flags:
        migStrategy: none
      sharing:
        timeSlicing:
          renameByDefault: true
          failRequestsGreaterThanOne: false
          resources:
            - name: nvidia.com/gpu
              replicas: 4      
  default: "default"

runtimeClassName: nvidia

image:
  repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin
  tag: "8b416016"

gfd:
  enabled: true

This enables the sub-chart of gfd. Regarding the 8b416016 tag, please refer to https://github.com/NVIDIA/k8s-device-plugin/issues/332.

After the chart is applied, I can see a nvdp-node-feature-discovery-master pod and a nvdp-node-feature-discovery-worker pod are created and running. No apparent errors were noticed in either of the pods. However, there's no trace of the nvdp-gpu-feature-discovery pod.

I can confirm that I can run pods with GPU resource requests.

I have another Gentoo Linux machine that joins my k8s cluster as an agent node. Up on joining, the following pods are deployed to this node and the GPU feature labels are added to this node.

I think the nfd master pod believes the WSL 2 node does not support GPU and decided not to deploy the gfd pod to this node at all.

davidshen84 commented 11 months ago

I also tried the nvgfd/gpu-feature-discovery chart directly and observed the same behaviour.

davidshen84 commented 11 months ago

I see. It is because NFD did not find any nvidia related device on the node because WSL does not expose the hardware information correctly.

Even if I add the nvidia.com/gpu.present label to the node to force GFD to be deployed on the node, I get this error in the pod:

Error getting machine type from /sys/class/dmi/id/product_name: could not open machine type file: open /sys/class/dmi/id/product_name: no such file or directory

Because there's no dmi in WSL. https://github.com/microsoft/WSL/issues/4391

davidshen84 commented 11 months ago

nvm...that error message is actually a warning. I can see the nvidia.com/* labels on my WSL node.