NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.68k stars 275 forks source link

No MIG devices exist after successfully applying the MIG configuration with migmanager #603

Open qasmi opened 8 months ago

qasmi commented 8 months ago

1. Quick Debug Information

2. Issue or feature description

I m trying to use gpu-operator to create and use MIG devices, the nvidia-migmanager applied my config, log => "Successfuly updated to MIG config: all-1g.5gb", The nodes are labelled :

kubectl get node -o json | jq '.items[].metadata.labels'

...
"nvidia.com/gpu.count": "0",
  "nvidia.com/gpu.deploy.container-toolkit": "true",
  "nvidia.com/gpu.deploy.dcgm": "true",
  "nvidia.com/gpu.deploy.dcgm-exporter": "true",
  "nvidia.com/gpu.deploy.device-plugin": "true",
  "nvidia.com/gpu.deploy.driver": "true",
  "nvidia.com/gpu.deploy.gpu-feature-discovery": "true",
  "nvidia.com/gpu.deploy.mig-manager": "true",
  "nvidia.com/gpu.deploy.node-status-exporter": "true",
  "nvidia.com/gpu.deploy.nvsm": "true",
  "nvidia.com/gpu.deploy.operator-validator": "true",
  "nvidia.com/gpu.family": "ampere",
  "nvidia.com/gpu.machine": "kind",
  "nvidia.com/gpu.memory": "0",
  "nvidia.com/gpu.present": "true",
  "nvidia.com/gpu.product": "NVIDIA-A100-80GB-PCIe-MIG-INVALID",
  "nvidia.com/gpu.replicas": "0",
  "nvidia.com/mig.capable": "true",
  "nvidia.com/mig.config": "all-1g.5gb",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "single"
...

but nvidia-device-plugin-daemonset is crashing with the error: error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

also nvidia-smi result from the driver pod shows:

Screenshot 2023-10-26 at 17 25 23

The screenshot indicates that MIG is enabled, but we are unable to list the devices created by migmanager. Any assistance would be greatly appreciated. 🙏.

3. Steps to reproduce the issue

elezar commented 8 months ago

Note that you have a A100 80GB device. I think the issue is that the all-3g.20gb configuration is invalid.

The 3g.20gb configuration is for an A100 40GB device.

If you want to limit memory, the profile on the A100 80GB device is all-1g.20gb and if you’re interested in compute this would be all-3g.40gb.

Did the mig-manager logs show something to this effect?

@shivamerla do we call out this caveat in the docs?

qasmi commented 8 months ago

Thank you for your feedback, I don’t have any errors in the mig-manager side => "Successfuly updated to MIG config: all-1g.5gb" from the logs, and the label also shows it


  "nvidia.com/mig.config": "all-1g.5gb",
  "nvidia.com/mig.config.state": "success",
elezar commented 8 months ago

Could you provide the full mig-manager logs?

elezar commented 8 months ago

Note that the all-1g.5gb profile is also not valid for an 80GB A100.

qasmi commented 8 months ago

@elezar The all-1g.20gb profile resolved my issue, and I'm grateful for your assistance. However, it's worth noting that the mig-manager didn't detect the incorrect configuration. I also observed that when applying the MIG configuration for the first time, mig-manager encountered challenges in enabling MIG mode:

...
time="2023-10-27T14:55:57Z" level=debug msg="At least one mode change pending"
time="2023-10-27T14:55:57Z" level=debug msg="Resetting all GPUs..."
time="2023-10-27T14:55:57Z" level=error msg="\nResetting GPU 00000000:00:05.0 is not supported.\n" 
...

I had to manually reboot my host to resolve the issue. Is this behavior expected or normal?

elezar commented 8 months ago

Rebooting the machine to toggle the mig mode is expected on certain systems. We explicitly disable automatic reboots in the context of the operator meaning that the cluster admin would have to reboot the node themselves.

Are there no additional logs once the mode change has been applied (i.e. the system has rebooted)?

shivamerla commented 8 months ago

@qasmi please refer to this file for supported profiles with each GPU type. Ideally mig-manager should have errored out with invalid configuration instead of marking it as success. We will look into that. Regarding GPU reset, that is applicable only in certain hypervisor environments which is noted here. Looks like you are using kind cluster in this case, so should not be applicable, did it fail intermittently with this error?

qasmi commented 8 months ago

@elezar the status change from failed to succeed after the reboot as expected

Applying the MIG mode change from the selected config to the node (and double checking it took effect)
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2023-10-27T14:59:58Z" level=debug msg="Parsing config file..."
time="2023-10-27T14:59:58Z" level=debug msg="Selecting specific MIG config..."
time="2023-10-27T14:59:58Z" level=debug msg="Running apply-start hook"
time="2023-10-27T14:59:58Z" level=debug msg="Checking current MIG mode..."
time="2023-10-27T14:59:58Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2023-10-27T14:59:58Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-10-27T14:59:58Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2023-10-27T14:59:58Z" level=debug msg="    MIG capable: true\n"
time="2023-10-27T14:59:58Z" level=debug msg="    Current MIG mode: Enabled"
time="2023-10-27T14:59:58Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
time="2023-10-27T14:59:58Z" level=debug msg="Parsing config file..."
time="2023-10-27T14:59:58Z" level=debug msg="Selecting specific MIG config..."
time="2023-10-27T14:59:58Z" level=debug msg="Asserting MIG mode configuration..."
time="2023-10-27T14:59:58Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2023-10-27T14:59:58Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-10-27T14:59:58Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2023-10-27T14:59:58Z" level=debug msg="    MIG capable: true\n"
time="2023-10-27T14:59:58Z" level=debug msg="    Current MIG mode: Enabled"
Selected MIG mode settings from configuration currently applied
Applying the selected MIG config to the node
time="2023-10-27T14:59:58Z" level=debug msg="Parsing config file..."
time="2023-10-27T14:59:58Z" level=debug msg="Selecting specific MIG config..."
time="2023-10-27T14:59:58Z" level=debug msg="Running apply-start hook"
time="2023-10-27T14:59:58Z" level=debug msg="Checking current MIG mode..."
time="2023-10-27T14:59:59Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2023-10-27T14:59:59Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-10-27T14:59:59Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2023-10-27T14:59:59Z" level=debug msg="    MIG capable: true\n"
time="2023-10-27T14:59:59Z" level=debug msg="    Current MIG mode: Enabled"
time="2023-10-27T14:59:59Z" level=debug msg="Checking current MIG device configuration..."
time="2023-10-27T14:59:59Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2023-10-27T14:59:59Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-10-27T14:59:59Z" level=debug msg="    Asserting MIG config: map[1g.20gb:4]"
time="2023-10-27T14:59:59Z" level=debug msg="Running pre-apply-config hook"
time="2023-10-27T14:59:59Z" level=debug msg="Applying MIG device configuration..."
time="2023-10-27T14:59:59Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2023-10-27T14:59:59Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-10-27T14:59:59Z" level=debug msg="    MIG capable: true\n"
time="2023-10-27T14:59:59Z" level=debug msg="    Updating MIG config: map[1g.20gb:4]"
MIG configuration applied successfully
time="2023-10-27T14:59:59Z" level=debug msg="Running apply-exit hook"
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-9hc2b" deleted
Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/nvidia-mig-control-plane labeled
Changing the 'nvidia.com/mig.config.state' node label to 'success'
node/nvidia-mig-control-plane labeled
time="2023-10-27T15:00:00Z" level=info msg="Successfuly updated to MIG config: all-1g.20gb"
time="2023-10-27T15:00:00Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

I lost the previous logs with the error after reporting my VM.

@shivamerla I m using Kind in top of Openstack VM

shivamerla commented 8 months ago

@qasmi glad to know its working with correct config now, will look into propagating error in case of invalid configs.

senshreyank commented 3 months ago

Hi, I am having 2 NVIDIA- A100-80GB card within one of my k8s worker node which has my custom mig profile added in the default-mig-parted-config ConfigMap of k8s under namespace of gpu-operator-resources.

      custom-mig:
        - devices: [1]
          mig-enabled: false
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7

on applying by command: kubectl label nodes gpu2 nvidia.com/mig.config=custom-mig --overwrite

the logs from nvidia-mig-manager-dzwr9 pod is : >

time="2024-04-22T12:47:15Z" level=info msg="Updating to MIG config: custom-mig"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
Current value of 'nvidia.com/gpu.deploy.nvsm=true'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=success'
Checking if the selected MIG config is currently applied or not
time="2024-04-22T12:47:16Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
time="2024-04-22T12:47:16Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Changing the 'nvidia.com/mig.config.state' node label to 'pending'
node/gpu2 labeled
Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels
node/gpu2 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-d4lff condition met
Waiting for gpu-feature-discovery to shutdown
pod/gpu-feature-discovery-gr59k condition met
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Removing the cuda-validator pod
pod "nvidia-cuda-validator-42h2v" deleted
Removing the plugin-validator pod
No resources found
Applying the MIG mode change from the selected config to the node (and double checking it took effect)
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2024-04-22T12:47:18Z" level=debug msg="Parsing config file..."
time="2024-04-22T12:47:18Z" level=debug msg="Selecting specific MIG config..."
time="2024-04-22T12:47:18Z" level=debug msg="Running apply-start hook"
time="2024-04-22T12:47:18Z" level=debug msg="Checking current MIG mode..."
time="2024-04-22T12:47:18Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-04-22T12:47:18Z" level=debug msg="  GPU 1: 0x20B510DE"
time="2024-04-22T12:47:18Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2024-04-22T12:47:18Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:18Z" level=debug msg="    Current MIG mode: Enabled"
time="2024-04-22T12:47:18Z" level=debug msg="Running pre-apply-mode hook"
time="2024-04-22T12:47:18Z" level=debug msg="Applying MIG mode change..."
time="2024-04-22T12:47:18Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-04-22T12:47:18Z" level=debug msg="  GPU 1: 0x20B510DE"
time="2024-04-22T12:47:18Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:18Z" level=debug msg="    Current MIG mode: Enabled"
time="2024-04-22T12:47:18Z" level=debug msg="    Clearing existing MIG configuration"
time="2024-04-22T12:47:18Z" level=debug msg="    Updating MIG mode: Disabled"
time="2024-04-22T12:47:21Z" level=debug msg="    Mode change pending: false"
time="2024-04-22T12:47:21Z" level=debug msg="Walking MigConfig for (devices=[0])"
time="2024-04-22T12:47:21Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2024-04-22T12:47:21Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:21Z" level=debug msg="    Current MIG mode: Enabled"
time="2024-04-22T12:47:21Z" level=debug msg="    Clearing existing MIG configuration"
time="2024-04-22T12:47:22Z" level=debug msg="    Updating MIG mode: Enabled"
time="2024-04-22T12:47:22Z" level=debug msg="    Mode change pending: false"
time="2024-04-22T12:47:22Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
time="2024-04-22T12:47:22Z" level=debug msg="Parsing config file..."
time="2024-04-22T12:47:22Z" level=debug msg="Selecting specific MIG config..."
time="2024-04-22T12:47:22Z" level=debug msg="Asserting MIG mode configuration..."
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 1: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2024-04-22T12:47:22Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:22Z" level=debug msg="    Current MIG mode: Disabled"
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[0])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2024-04-22T12:47:22Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:22Z" level=debug msg="    Current MIG mode: Enabled"
Selected MIG mode settings from configuration currently applied
Applying the selected MIG config to the node
time="2024-04-22T12:47:22Z" level=debug msg="Parsing config file..."
time="2024-04-22T12:47:22Z" level=debug msg="Selecting specific MIG config..."
time="2024-04-22T12:47:22Z" level=debug msg="Running apply-start hook"
time="2024-04-22T12:47:22Z" level=debug msg="Checking current MIG mode..."
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 1: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2024-04-22T12:47:22Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:22Z" level=debug msg="    Current MIG mode: Disabled"
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[0])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="    Asserting MIG mode: Enabled"
time="2024-04-22T12:47:22Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:22Z" level=debug msg="    Current MIG mode: Enabled"
time="2024-04-22T12:47:22Z" level=debug msg="Checking current MIG device configuration..."
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 1: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[0])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="    Asserting MIG config: map[1g.10gb:7]"
time="2024-04-22T12:47:22Z" level=debug msg="Running pre-apply-config hook"
time="2024-04-22T12:47:22Z" level=debug msg="Applying MIG device configuration..."
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[1])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 1: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:22Z" level=debug msg="    Skipping MIG config -- MIG disabled"
time="2024-04-22T12:47:22Z" level=debug msg="Walking MigConfig for (devices=[0])"
time="2024-04-22T12:47:22Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2024-04-22T12:47:22Z" level=debug msg="    MIG capable: true\n"
time="2024-04-22T12:47:22Z" level=debug msg="    Updating MIG config: map[1g.10gb:7]"
time="2024-04-22T12:47:23Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-g75kv" deleted
Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/gpu2 labeled
Changing the 'nvidia.com/mig.config.state' node label to 'success'
node/gpu2 labeled
time="2024-04-22T12:47:55Z" level=info msg="Successfuly updated to MIG config: custom-mig"
time="2024-04-22T12:47:55Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

and logs from nvidia-device-plugin-daemonset-988rz pod is : >

k logs -f nvidia-device-plugin-daemonset-phq9d -n gpu-operator-resources
Defaulted container "nvidia-device-plugin" out of: nvidia-device-plugin, toolkit-validation (init)
NVIDIA_DRIVER_ROOT=/run/nvidia/driver
CONTAINER_DRIVER_ROOT=/run/nvidia/driver
Starting nvidia-device-plugin
I0422 12:59:05.155987       1 main.go:154] Starting FS watcher.
I0422 12:59:05.156055       1 main.go:161] Starting OS watcher.
I0422 12:59:05.156292       1 main.go:176] Starting Plugins.
I0422 12:59:05.156304       1 main.go:234] Loading configuration.
I0422 12:59:05.156370       1 main.go:242] Updating config with default resource matching patterns.
I0422 12:59:05.156492       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/run/nvidia/driver",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/run/nvidia/driver"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ],
    "mig": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0422 12:59:05.156498       1 main.go:256] Retreiving plugins.
I0422 12:59:05.156829       1 factory.go:107] Detected NVML platform: found NVML library
I0422 12:59:05.156851       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0422 12:59:05.725565       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: all devices on the node must be configured with the same migEnabled value

so Isn't it possible to configure 2 different custom settings of MIG on different devices attached if yes please let me know the solution.

Thanks in advanced.

elezar commented 3 months ago

@senshreyank could you please open a NEW issue with your problem. Please also provide the versions of the components that you are using.