NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

Back to the default config file after apply the mig config #249

Closed Kaka1127 closed 3 years ago

Kaka1127 commented 3 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

Back to the default config file after applying the mig config(i.e. changed the value of "nvidia.com/mig.config"). Note: Previous version(v1.7.0) did not happen this issue.

2. Steps to reproduce the issue

Just it only deployed the latest GPU Operator.

3. Information to attach (optional if deemed irrelevant)

shivamerla commented 3 years ago

@Kaka1127 can you provide a bit more detail on this? Did you try to update mig-parted-config ConfigMap? or after adding label to node nvidia.com/mig.config the config was not applied? Can you attach mig-manager pod logs as well?

Kaka1127 commented 3 years ago

@shivamerla

Sure.

Did you try to update mig-parted-config ConfigMap? or after adding label to node nvidia.com/mig.config the config was not applied?

Yes. I modified the ConfigMap like below. And succeed to change the configuration but not present the custom configuration on mig-parted-config file after applied.

$ kubectl edit configmap mig-parted-config -n gpu-operator-resources
## Add configuration
      optimize:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "1g.5gb": 2
            "2g.10gb": 1
            "3g.20gb": 1
       - devices: [1]
         mig-enabled: false

Can you attach mig-manager pod logs as well?

Sure. Please see the attached log file.

container.log

Best regards. Kaka

shivamerla commented 3 years ago

Ah, we would need to provide an option for user to specify custom mig-parted-config as operator will reconcile the default one to the original. Will look into it.

Kaka1127 commented 3 years ago

@shivamerla Thanks! I am waiting for your update.

Kaka1127 commented 3 years ago

Any update?

shivamerla commented 3 years ago

@Kaka1127 we are planning to release v1.8.2 later this month after QA completion. Meanwhile you can install using private image if required.

Clone GPU Operator repo and follow below steps:

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml

$ helm upgrade gpu-operator deployments/gpu-operator --set operator.repository=quay.io/shivamerla --set operator.version=v1.8.2 --set validator.version=v1.8.1 --set nodeStatusExporter.version=v1.8.1 --set migManager.config.name=<migparted-configmap-name>

Here provide the ConfigMap name you created in gpu-operator-resources namespace.

Kaka1127 commented 3 years ago

@shivamerla Thank you for your update! I updated with following your way but I met the issue. It seems that the mig manger referred the default configmap... I checked the clusterpolicy and it did not present the setting of migManager.config like below even though applied the latest one.

$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml
customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured
$ helm upgrade gpu-operator nvidia/gpu-operator --set operator.repository=quay.io/shivamerla --set operator.version=v1.8.2 --set migManager.config.name=custom-mig-parted-config --set mig.strategy=mixed
$ kubectl edit clusterpolicy
  migManager:
    enabled: true
    env:
    - name: WITH_REBOOT
      value: "false"
    image: k8s-mig-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    securityContext:
      privileged: true
    version: v0.1.2-ubuntu20.04
shivamerla commented 3 years ago

i see that the change exists here to apply this for ClusterPolicy using helm templates. Will double check tomorrow. Meanwhile you can edit the ClusterPolicy and add this entry manually.

Kaka1127 commented 3 years ago

Uummm... I added the config.name at migManager section like below but it showed the "failed" status on "nvidia.com/mig.config.state" when I inserted my custom config file even though it was set the default mig profile like an all-disabled on "nvidia.com/mig.config".

$ kubectl edit clusterpolicy
  migManager:
    config:
      name: custom-mig-parted-config
    enabled: true
    env:
    - name: WITH_REBOOT
      value: "false"
    image: k8s-mig-manager
    imagePullPolicy: IfNotPresent
    repository: nvcr.io/nvidia/cloud-native
    securityContext:
      privileged: true
$ kubectl get -n gpu-operator-resources configmap
NAME                        DATA   AGE
custom-mig-parted-config    1      44m
default-mig-parted-config   1      41m
kube-root-ca.crt            1      22d
mig-parted-config           1      22d
$ kubectl describe -n gpu-operator-resources configmap custom-mig-parted-config
Name:         custom-mig-parted-config
Namespace:    gpu-operator-resources
Labels:       <none>
Annotations:  <none>

Data
====
config.yaml:
----
version: v1
mig-configs:
  all-disabled:
    - devices: all
      mig-enabled: false
  # A100-40GB
  all-1g.5gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.5gb": 7
  all-2g.10gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.10gb": 3
  all-3g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.20gb": 2
  all-7g.40gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "7g.40gb": 1
  optimize:
    - devices: [0]
      mig-enabled: true
      mig-devices:
        "1g.5gb": 2
        "2g.10gb": 1
        "3g.20gb": 1
   - devices: [1]
     mig-enabled: false

Events:  <none>
Kaka1127 commented 3 years ago

Sorry. I found issue on my custom mig config file.. The indent was shifted.

shivamerla commented 3 years ago

@Kaka1127 does this work as expected now?

Kaka1127 commented 3 years ago

@shivamerla Yes, it worked as I expected.