Closed Kaka1127 closed 3 years ago
@Kaka1127 can you provide a bit more detail on this? Did you try to update mig-parted-config ConfigMap? or after adding label to node nvidia.com/mig.config
the config was not applied? Can you attach mig-manager pod logs as well?
@shivamerla
Sure.
Did you try to update mig-parted-config ConfigMap? or after adding label to node nvidia.com/mig.config the config was not applied?
Yes. I modified the ConfigMap like below. And succeed to change the configuration but not present the custom configuration on mig-parted-config file after applied.
$ kubectl edit configmap mig-parted-config -n gpu-operator-resources
## Add configuration
optimize:
- devices: [0]
mig-enabled: true
mig-devices:
"1g.5gb": 2
"2g.10gb": 1
"3g.20gb": 1
- devices: [1]
mig-enabled: false
Can you attach mig-manager pod logs as well?
Sure. Please see the attached log file.
Best regards. Kaka
Ah, we would need to provide an option for user to specify custom mig-parted-config as operator will reconcile the default one to the original. Will look into it.
@shivamerla Thanks! I am waiting for your update.
Any update?
@Kaka1127 we are planning to release v1.8.2
later this month after QA completion. Meanwhile you can install using private image if required.
Clone GPU Operator repo and follow below steps:
$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml
$ helm upgrade gpu-operator deployments/gpu-operator --set operator.repository=quay.io/shivamerla --set operator.version=v1.8.2 --set validator.version=v1.8.1 --set nodeStatusExporter.version=v1.8.1 --set migManager.config.name=<migparted-configmap-name>
Here provide the ConfigMap name you created in gpu-operator-resources
namespace.
@shivamerla Thank you for your update! I updated with following your way but I met the issue. It seems that the mig manger referred the default configmap... I checked the clusterpolicy and it did not present the setting of migManager.config like below even though applied the latest one.
$ kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/deployments/gpu-operator/crds/nvidia.com_clusterpolicies_crd.yaml
customresourcedefinition.apiextensions.k8s.io/clusterpolicies.nvidia.com configured
$ helm upgrade gpu-operator nvidia/gpu-operator --set operator.repository=quay.io/shivamerla --set operator.version=v1.8.2 --set migManager.config.name=custom-mig-parted-config --set mig.strategy=mixed
$ kubectl edit clusterpolicy
migManager:
enabled: true
env:
- name: WITH_REBOOT
value: "false"
image: k8s-mig-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
securityContext:
privileged: true
version: v0.1.2-ubuntu20.04
i see that the change exists here to apply this for ClusterPolicy using helm templates. Will double check tomorrow. Meanwhile you can edit the ClusterPolicy and add this entry manually.
Uummm... I added the config.name at migManager section like below but it showed the "failed" status on "nvidia.com/mig.config.state" when I inserted my custom config file even though it was set the default mig profile like an all-disabled on "nvidia.com/mig.config".
$ kubectl edit clusterpolicy
migManager:
config:
name: custom-mig-parted-config
enabled: true
env:
- name: WITH_REBOOT
value: "false"
image: k8s-mig-manager
imagePullPolicy: IfNotPresent
repository: nvcr.io/nvidia/cloud-native
securityContext:
privileged: true
$ kubectl get -n gpu-operator-resources configmap
NAME DATA AGE
custom-mig-parted-config 1 44m
default-mig-parted-config 1 41m
kube-root-ca.crt 1 22d
mig-parted-config 1 22d
$ kubectl describe -n gpu-operator-resources configmap custom-mig-parted-config
Name: custom-mig-parted-config
Namespace: gpu-operator-resources
Labels: <none>
Annotations: <none>
Data
====
config.yaml:
----
version: v1
mig-configs:
all-disabled:
- devices: all
mig-enabled: false
# A100-40GB
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 7
all-2g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.10gb": 3
all-3g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.20gb": 2
all-7g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"7g.40gb": 1
optimize:
- devices: [0]
mig-enabled: true
mig-devices:
"1g.5gb": 2
"2g.10gb": 1
"3g.20gb": 1
- devices: [1]
mig-enabled: false
Events: <none>
Sorry. I found issue on my custom mig config file.. The indent was shifted.
@Kaka1127 does this work as expected now?
@shivamerla Yes, it worked as I expected.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? -> Only for "ipmi_msghandler"kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
Back to the default config file after applying the mig config(i.e. changed the value of "nvidia.com/mig.config"). Note: Previous version(v1.7.0) did not happen this issue.
2. Steps to reproduce the issue
Just it only deployed the latest GPU Operator.
3. Information to attach (optional if deemed irrelevant)
kubectl get pods --all-namespaces
kubectl get ds --all-namespaces
docker run -it alpine echo foo
cat /etc/docker/daemon.json
docker info | grep Runtime
ls -la /run/nvidia
ls -la /usr/local/nvidia/toolkit
ls -la /run/nvidia/driver
journalctl -u kubelet > kubelet.logs