Back to the default config file after apply the mig config

Kaka1127 commented 3 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node? -> Ubuntu 20.04.2
[x] Are you running Kubernetes v1.13+? -> v1.20.4
[x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? -> Docker 19.03.15
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes? -> Only for "ipmi_msghandler"
[x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Back to the default config file after applying the mig config(i.e. changed the value of "nvidia.com/mig.config"). Note: Previous version(v1.7.0) did not happen this issue.

2. Steps to reproduce the issue

Just it only deployed the latest GPU Operator.

3. Information to attach (optional if deemed irrelevant)

[x] kubernetes pods status: kubectl get pods --all-namespaces

NAMESPACE                NAME                                                          READY   STATUS      RESTARTS   AGE
cattle-system            cattle-cluster-agent-6d969d75b8-x6s8w                         1/1     Running     7          42d
cattle-system            cattle-node-agent-zbxjz                                       1/1     Running     7          42d
default                  gpu-operator-796df7c697-nkhx4                                 1/1     Running     0          144m
default                  gpu-operator-node-feature-discovery-master-58d884d5cc-2r8dm   1/1     Running     0          144m
default                  gpu-operator-node-feature-discovery-worker-6zmd8              1/1     Running     0          144m
gpu-operator-resources   gpu-feature-discovery-mgx6s                                   1/1     Running     0          131m
gpu-operator-resources   nvidia-container-toolkit-daemonset-7qqjr                      1/1     Running     0          143m
gpu-operator-resources   nvidia-cuda-validator-j525v                                   0/1     Completed   0          130m
gpu-operator-resources   nvidia-dcgm-552mm                                             1/1     Running     0          91m
gpu-operator-resources   nvidia-dcgm-exporter-xd8z5                                    1/1     Running     0          20m
gpu-operator-resources   nvidia-device-plugin-daemonset-bj7ft                          1/1     Running     0          131m
gpu-operator-resources   nvidia-device-plugin-validator-vz8d7                          0/1     Completed   0          130m
gpu-operator-resources   nvidia-driver-daemonset-8phcx                                 1/1     Running     0          143m
gpu-operator-resources   nvidia-mig-manager-r5pq4                                      1/1     Running     0          143m
gpu-operator-resources   nvidia-operator-validator-b2l54                               1/1     Running     0          130m
kube-system              calico-kube-controllers-6949477b58-jsprr                      1/1     Running     7          42d
kube-system              calico-node-p8b2p                                             1/1     Running     7          42d
kube-system              coredns-74ff55c5b-288sr                                       1/1     Running     7          42d
kube-system              coredns-74ff55c5b-nm7vm                                       1/1     Running     7          42d
kube-system              etcd-a100-server-b                                            1/1     Running     7          42d
kube-system              kube-apiserver-a100-server-b                                  1/1     Running     8          42d
kube-system              kube-controller-manager-a100-server-b                         1/1     Running     7          42d 
kube-system              kube-proxy-jkdm4                                              1/1     Running     7          42d
kube-system              kube-scheduler-a100-server-b                                  1/1     Running     7          42d
monitoring               kube-prometheus-stack-grafana-5c5c84b568-4r96r                2/2     Running     12         40d
monitoring               kube-prometheus-stack-kube-state-metrics-6f85498dd8-xsdzs     1/1     Running     6          40d
monitoring               kube-prometheus-stack-operator-64c65d9fdd-9z7mn               1/1     Running     6          40d
monitoring               kube-prometheus-stack-prometheus-node-exporter-lvlkw          1/1     Running     6          40d
monitoring               prometheus-kube-prometheus-stack-prometheus-0                 2/2     Running     13         40d

[x] kubernetes daemonset status: kubectl get ds --all-namespaces

NAMESPACE                NAME                                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
cattle-system            cattle-node-agent                                1         1         1       1            1           <none>                                             42d
default                  gpu-operator-node-feature-discovery-worker       1         1         1       1            1           <none>                                             148m
gpu-operator-resources   gpu-feature-discovery                            1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true   146m
gpu-operator-resources   nvidia-container-toolkit-daemonset               1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true       146m
gpu-operator-resources   nvidia-dcgm                                      1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                    146m
gpu-operator-resources   nvidia-dcgm-exporter                             1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true           146m
gpu-operator-resources   nvidia-device-plugin-daemonset                   1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true           146m
gpu-operator-resources   nvidia-driver-daemonset                          1         1         1       1            1           nvidia.com/gpu.deploy.driver=true                  146m
gpu-operator-resources   nvidia-mig-manager                               1         1         1       1            1           nvidia.com/gpu.deploy.mig-manager=true             146m
gpu-operator-resources   nvidia-operator-validator                        1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true      146m
kube-system              calico-node                                      1         1         1       1            1           kubernetes.io/os=linux                             42d
kube-system              kube-proxy                                       1         1         1       1            1           kubernetes.io/os=linux                             42d
monitoring               kube-prometheus-stack-prometheus-node-exporter   1         1         1       1            1           <none>                                             40d

[x] Output of running a container on the GPU machine: docker run -it alpine echo foo

Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
29291e31a76a: Pull complete 
Digest: sha256:eb3e4e175ba6d212ba1d6e04fc0782916c08e1c9d7b45892e9796141b1d379ae
Status: Downloaded newer image for alpine:latest
foo

[x] Docker configuration file: cat /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "exec-opts": [
        "native.cgroupdriver=systemd"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
        },
        "nvidia-experimental": {
            "args": [],
            "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
        }
    },
    "storage-driver": "overlay2",
    "storage-opts": [
        "overlay2.override_kernel_check=true"
    ]
}

[x] Docker runtime configuration: docker info | grep Runtime

docker info | grep Runtime
Runtimes: runc nvidia nvidia-experimental
WARNING: No swap limit support
Default Runtime: nvidia

[x] NVIDIA shared directory: ls -la /run/nvidia

total 12
drwxr-xr-x  4 root root  120 Aug 16 02:04 .
drwxr-xr-x 39 root root 1240 Aug 16 02:04 ..
drwxr-xr-x  1 root root 4096 Aug 16 02:02 driver
-rw-r--r--  1 root root    6 Aug 16 02:03 nvidia-driver.pid
-rw-r--r--  1 root root    6 Aug 16 02:04 toolkit.pid
drwxr-xr-x  2 root root  120 Aug 16 02:16 validations

[x] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

total 8552
drwxr-xr-x 3 root root    4096 Aug 16 02:04 .
drwxr-xr-x 3 root root    4096 Aug 16 02:04 ..
drwxr-xr-x 3 root root    4096 Aug 16 02:04 .config
lrwxrwxrwx 1 root root      28 Aug 16 02:04 libnvidia-container.so.1 -> libnvidia-container.so.1.4.0
-rwxr-xr-x 1 root root  175120 Aug 16 02:04 libnvidia-container.so.1.4.0
-rwxr-xr-x 1 root root     154 Aug 16 02:04 nvidia-container-cli
-rwxr-xr-x 1 root root   43024 Aug 16 02:04 nvidia-container-cli.real
-rwxr-xr-x 1 root root     342 Aug 16 02:04 nvidia-container-runtime
-rwxr-xr-x 1 root root     429 Aug 16 02:04 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3991000 Aug 16 02:04 nvidia-container-runtime.experimental
lrwxrwxrwx 1 root root      24 Aug 16 02:04 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x 1 root root 2359384 Aug 16 02:04 nvidia-container-runtime.real
-rwxr-xr-x 1 root root     198 Aug 16 02:04 nvidia-container-toolkit
-rwxr-xr-x 1 root root 2147896 Aug 16 02:04 nvidia-container-toolkit.real

[x] NVIDIA driver directory: ls -la /run/nvidia/driver

total 88
drwxr-xr-x    1 root root  4096 Aug 16 02:02 .
drwxr-xr-x    4 root root   120 Aug 16 02:04 ..
lrwxrwxrwx    1 root root     7 Jul 23 17:35 bin -> usr/bin
drwxr-xr-x    2 root root  4096 Apr 15  2020 boot
drwxr-xr-x   17 root root  4300 Aug 16 02:04 dev
-rwxr-xr-x    1 root root     0 Aug 16 02:02 .dockerenv
drwxr-xr-x    1 root root  4096 Aug 16 02:02 drivers
drwxr-xr-x    1 root root  4096 Aug 16 02:04 etc
drwxr-xr-x    2 root root  4096 Apr 15  2020 home
drwxr-xr-x    2 root root  4096 Aug 16 02:02 host-etc
lrwxrwxrwx    1 root root     7 Jul 23 17:35 lib -> usr/lib
lrwxrwxrwx    1 root root     9 Jul 23 17:35 lib32 -> usr/lib32
lrwxrwxrwx    1 root root     9 Jul 23 17:35 lib64 -> usr/lib64
lrwxrwxrwx    1 root root    10 Jul 23 17:35 libx32 -> usr/libx32
drwxr-xr-x    2 root root  4096 Jul 23 17:35 media
drwxr-xr-x    2 root root  4096 Jul 23 17:35 mnt
-rw-r--r--    1 root root 16047 Aug  3 20:33 NGC-DL-CONTAINER-LICENSE
drwxr-xr-x    2 root root  4096 Jul 23 17:35 opt
dr-xr-xr-x 1204 root root     0 Aug 16 01:57 proc
drwx------    2 root root  4096 Jul 23 17:38 root
drwxr-xr-x    1 root root  4096 Aug 16 02:04 run
lrwxrwxrwx    1 root root     8 Jul 23 17:35 sbin -> usr/sbin
drwxr-xr-x    2 root root  4096 Jul 23 17:35 srv
dr-xr-xr-x   13 root root     0 Aug 16 02:02 sys
drwxrwxrwt    1 root root  4096 Aug 16 02:04 tmp
drwxr-xr-x    1 root root  4096 Jul 23 17:35 usr
drwxr-xr-x    1 root root  4096 Jul 23 17:38 var

[x] kubelet logs journalctl -u kubelet > kubelet.logs

Aug 16 04:06:19 a100-server-b kubelet[1913]: time="2021-08-16T04:06:19Z" level=info msg="Released host-wide IPAM lock." source="ipam_plugin.go:369"
Aug 16 04:06:19 a100-server-b kubelet[1913]: 2021-08-16 04:06:19.789 [INFO][232681] ipam_plugin.go 276: Calico CNI IPAM assigned addresses IPv4=[10.244.69.185/26] IPv6=[] ContainerID="66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" HandleID="k8s-pod-network.66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" Workload="a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0"
Aug 16 04:06:19 a100-server-b kubelet[1913]: 2021-08-16 04:06:19.792 [INFO][232663] k8s.go 370: Populated endpoint ContainerID="66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" Namespace="gpu-operator-resources" Pod="nvidia-dcgm-exporter-xd8z5" WorkloadEndpoint="a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0", GenerateName:"nvidia-dcgm-exporter-", Namespace:"gpu-operator-resources", SelfLink:"", UID:"f4bd2a79-07fa-4898-9c89-f4d6573f453e", ResourceVersion:"7981754", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63764683578, loc:(*time.Location)(0x26ec240)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app":"nvidia-dcgm-exporter", "controller-revision-hash":"686f458d84", "pod-template-generation":"4", "projectcalico.org/namespace":"gpu-operator-resources", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"nvidia-dcgm-exporter"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"a100-server-b", ContainerID:"", Pod:"nvidia-dcgm-exporter-xd8z5", Endpoint:"eth0", IPNetworks:[]string{"10.244.69.185/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.gpu-operator-resources", "ksa.gpu-operator-resources.nvidia-dcgm-exporter"}, InterfaceName:"cali2fbdf42462c", MAC:"", Ports:[]v3.EndpointPort{v3.EndpointPort{Name:"metrics", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x24b8}}}}
Aug 16 04:06:19 a100-server-b kubelet[1913]: 2021-08-16 04:06:19.792 [INFO][232663] k8s.go 371: Calico CNI using IPs: [10.244.69.185/32] ContainerID="66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" Namespace="gpu-operator-resources" Pod="nvidia-dcgm-exporter-xd8z5" WorkloadEndpoint="a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0"
Aug 16 04:06:19 a100-server-b kubelet[1913]: 2021-08-16 04:06:19.792 [INFO][232663] dataplane_linux.go 66: Setting the host side veth name to cali2fbdf42462c ContainerID="66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" Namespace="gpu-operator-resources" Pod="nvidia-dcgm-exporter-xd8z5" WorkloadEndpoint="a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0"
Aug 16 04:06:19 a100-server-b kubelet[1913]: 2021-08-16 04:06:19.794 [INFO][232663] dataplane_linux.go 420: Disabling IPv4 forwarding ContainerID="66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" Namespace="gpu-operator-resources" Pod="nvidia-dcgm-exporter-xd8z5" WorkloadEndpoint="a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0"
Aug 16 04:06:19 a100-server-b kubelet[1913]: 2021-08-16 04:06:19.827 [INFO][232663] k8s.go 398: Added Mac, interface name, and active container ID to endpoint ContainerID="66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" Namespace="gpu-operator-resources" Pod="nvidia-dcgm-exporter-xd8z5" WorkloadEndpoint="a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0", GenerateName:"nvidia-dcgm-exporter-", Namespace:"gpu-operator-resources", SelfLink:"", UID:"f4bd2a79-07fa-4898-9c89-f4d6573f453e", ResourceVersion:"7981754", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63764683578, loc:(*time.Location)(0x26ec240)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"app":"nvidia-dcgm-exporter", "controller-revision-hash":"686f458d84", "pod-template-generation":"4", "projectcalico.org/namespace":"gpu-operator-resources", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"nvidia-dcgm-exporter"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v3.WorkloadEndpointSpec{Orchestrator:"k8s", Workload:"", Node:"a100-server-b", ContainerID:"66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a", Pod:"nvidia-dcgm-exporter-xd8z5", Endpoint:"eth0", IPNetworks:[]string{"10.244.69.185/32"}, IPNATs:[]v3.IPNAT(nil), IPv4Gateway:"", IPv6Gateway:"", Profiles:[]string{"kns.gpu-operator-resources", "ksa.gpu-operator-resources.nvidia-dcgm-exporter"}, InterfaceName:"cali2fbdf42462c", MAC:"a2:9c:e6:a3:bd:40", Ports:[]v3.EndpointPort{v3.EndpointPort{Name:"metrics", Protocol:numorstring.Protocol{Type:1, NumVal:0x0, StrVal:"TCP"}, Port:0x24b8}}}}
Aug 16 04:06:19 a100-server-b kubelet[1913]: 2021-08-16 04:06:19.837 [INFO][232663] k8s.go 472: Wrote updated endpoint to datastore ContainerID="66922014d8426794429ca85d8402e0e92c5c9002940c02b8bd32cf5bc5636b4a" Namespace="gpu-operator-resources" Pod="nvidia-dcgm-exporter-xd8z5" WorkloadEndpoint="a100--server--b-k8s-nvidia--dcgm--exporter--xd8z5-eth0"
Aug 16 04:13:21 a100-server-b kubelet[1913]: I0816 04:13:21.042131    1913 log.go:181] http: superfluous response.WriteHeader call from k8s.io/kubernetes/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader (httplog.go:217)

shivamerla commented 3 years ago

@Kaka1127 can you provide a bit more detail on this? Did you try to update mig-parted-config ConfigMap? or after adding label to node nvidia.com/mig.config the config was not applied? Can you attach mig-manager pod logs as well?

Kaka1127 commented 3 years ago

@shivamerla

Sure.

Did you try to update mig-parted-config ConfigMap? or after adding label to node nvidia.com/mig.config the config was not applied?

Yes. I modified the ConfigMap like below. And succeed to change the configuration but not present the custom configuration on mig-parted-config file after applied.

$ kubectl edit configmap mig-parted-config -n gpu-operator-resources
## Add configuration
      optimize:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "1g.5gb": 2
            "2g.10gb": 1
            "3g.20gb": 1
       - devices: [1]
         mig-enabled: false

Can you attach mig-manager pod logs as well?

Sure. Please see the attached log file.

container.log

Best regards. Kaka

shivamerla commented 3 years ago

Ah, we would need to provide an option for user to specify custom mig-parted-config as operator will reconcile the default one to the original. Will look into it.

Kaka1127 commented 3 years ago

@shivamerla Thanks! I am waiting for your update.

Kaka1127 commented 3 years ago

Any update?

shivamerla commented 3 years ago

@Kaka1127 we are planning to release v1.8.2 later this month after QA completion. Meanwhile you can install using private image if required.

Clone GPU Operator repo and follow below steps:

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml

$ helm upgrade gpu-operator deployments/gpu-operator --set operator.repository=quay.io/shivamerla --set operator.version=v1.8.2 --set validator.version=v1.8.1 --set nodeStatusExporter.version=v1.8.1 --set migManager.config.name=<migparted-configmap-name>

Here provide the ConfigMap name you created in gpu-operator-resources namespace.

Kaka1127 commented 3 years ago

@shivamerla Thank you for your update! I updated with following your way but I met the issue. It seems that the mig manger referred the default configmap... I checked the clusterpolicy and it did not present the setting of migManager.config like below even though applied the latest one.

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml
customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured
$ helm upgrade gpu-operator nvidia/gpu-operator --set operator.repository=quay.io/shivamerla --set operator.version=v1.8.2 --set migManager.config.name=custom-mig-parted-config --set mig.strategy=mixed
$ kubectl edit clusterpolicy
  migManager:
    enabled: true
    env:
    - name: WITH_REBOOT
      value: "false"
    image: k8s-mig-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    securityContext:
      privileged: true
    version: v0.1.2-ubuntu20.04

shivamerla commented 3 years ago

i see that the change exists here to apply this for ClusterPolicy using helm templates. Will double check tomorrow. Meanwhile you can edit the ClusterPolicy and add this entry manually.

Kaka1127 commented 3 years ago

Uummm... I added the config.name at migManager section like below but it showed the "failed" status on "nvidia.com/mig.config.state" when I inserted my custom config file even though it was set the default mig profile like an all-disabled on "nvidia.com/mig.config".

$ kubectl edit clusterpolicy
  migManager:
    config:
      name: custom-mig-parted-config
    enabled: true
    env:
    - name: WITH_REBOOT
      value: "false"
    image: k8s-mig-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    securityContext:
      privileged: true
$ kubectl get -n gpu-operator-resources configmap
NAME                        DATA   AGE
custom-mig-parted-config    1      44m
default-mig-parted-config   1      41m
kube-root-ca.crt            1      22d
mig-parted-config           1      22d
$ kubectl describe -n gpu-operator-resources configmap custom-mig-parted-config
Name:         custom-mig-parted-config
Namespace:    gpu-operator-resources
Labels:       <none>
Annotations:  <none>

Data
====
config.yaml:
----
version: v1
mig-configs:
  all-disabled:
    - devices: all
      mig-enabled: false
  # A100-40GB
  all-1g.5gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.5gb": 7
  all-2g.10gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.10gb": 3
  all-3g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.20gb": 2
  all-7g.40gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "7g.40gb": 1
  optimize:
    - devices: [0]
      mig-enabled: true
      mig-devices:
        "1g.5gb": 2
        "2g.10gb": 1
        "3g.20gb": 1
   - devices: [1]
     mig-enabled: false

Events:  <none>

Kaka1127 commented 3 years ago

Sorry. I found issue on my custom mig config file.. The indent was shifted.

shivamerla commented 3 years ago

@Kaka1127 does this work as expected now?

Kaka1127 commented 3 years ago

@shivamerla Yes, it worked as I expected.

NVIDIA / gpu-operator