Closed d-m closed 3 years ago
@d-m This is one the feature that will be part of 1.7.0 release. Its still in review: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/207
Is it possible for your to validate with private build on your system?
Thanks @shivamerla! Yes I can do that. I'll try today or tomorrow.
Thanks @d-m. Please make sure to delete old clusterpolicies CRD's, gpu-operator clusterroles/bindings before you deploy this, just in case if they are lying around.
I tried installing the operator from your branch specified in the MR and received the following error:
$ kubectl logs -n kube-system gpu-operator-7d8ffb476f-b8b8x
unknown flag: --leader-elect
Usage of gpu-operator:
--zap-devel Enable zap development mode (changes defaults to console encoder, debug log level, disables sampling and stacktrace from 'warning' level)
--zap-encoder encoder Zap log encoding ('json' or 'console')
--zap-level level Zap log level (one of 'debug', 'info', 'error' or any integer value > 0) (default info)
--zap-sample sample Enable zap log sampling. Sampling will be disabled for integer log levels > 1
--zap-stacktrace-level level Set the minimum log level that triggers stacktrace generation (default error)
--zap-time-encoding timeEncoding Sets the zap time format ('epoch', 'millis', 'nano', or 'iso8601') (default )
unknown flag: --leader-elect
When I deleted the --leader-elect
flag from the template, I got a new error:
$ kubectl logs -n kube-system gpu-operator-665fdc747f-497q6
{"level":"info","ts":1620251700.8265946,"logger":"cmd","msg":"Go Version: go1.13.15"}
{"level":"info","ts":1620251700.8266578,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1620251700.8266659,"logger":"cmd","msg":"Version of operator-sdk: v0.17.0"}
{"level":"info","ts":1620251700.8269262,"logger":"leader","msg":"Trying to become the leader."}
{"level":"error","ts":1620251702.480583,"logger":"cmd","msg":"","error":"required env POD_NAME not set, please configure downward API","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nmain.main\n\t/go/src/github.com/NVIDIA/gpu-operator/cmd/manager/main.go:69\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
I verified that the downward API was configured in the deployed pod security policy, however it looks like there are some other changes when compared to the 1.6.2 version of the helm chart which has POD_NAME
defined.
Is there a development version of the image that goes along with these changes? It's still specified as 1.6.2 and I didn't see anything newer at https://ngc.nvidia.com/catalog/containers/nvidia:gpu-operator/tags.
you would need to build a private image from master branch. Or you can use quay.io/shivamerla/gpu-operator:psp
image.
@shivamerla I'll try that today.
Looks like the new helm chart deploys successfully once I used the updated image.
However, now I'm running into the following error with the nvidia-device-plugin-daemonset:
Warning FailedCreatePodSandBox 67s (x49 over 11m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
This might be unrelated, so I'll double check the documentation to make sure I didn't miss something.
@d-m Thanks for checking. Please make sure right runtime is passed during install with --set operator.defaultRuntime=
with either docker
or containerd
.
I have it set to containerd and the toolkit seems to complete successfully:
time="2021-05-06T14:18:32Z" level=info msg="Starting 'setup' for containerd"
time="2021-05-06T14:18:32Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2021-05-06T14:18:32Z" level=info msg="Successfully parsed arguments"
time="2021-05-06T14:18:32Z" level=info msg="Loading config: /runtime/config-dir/config.toml"
time="2021-05-06T14:18:32Z" level=info msg="Config file does not exist, creating new one"
time="2021-05-06T14:18:32Z" level=info msg="Successfully loaded config"
time="2021-05-06T14:18:32Z" level=info msg="Containerd version is v1.4.4"
time="2021-05-06T14:18:32Z" level=info msg="Config version: 2"
time="2021-05-06T14:18:32Z" level=info msg="Updating config"
time="2021-05-06T14:18:32Z" level=info msg="Successfully updated config"
time="2021-05-06T14:18:32Z" level=info msg="Flushing config"
time="2021-05-06T14:18:32Z" level=info msg="Successfully flushed config"
time="2021-05-06T14:18:32Z" level=info msg="Sending SIGHUP signal to containerd"
time="2021-05-06T14:18:32Z" level=info msg="Successfully signaled containerd"
time="2021-05-06T14:18:32Z" level=info msg="Completed 'setup' for containerd"
time="2021-05-06T14:18:32Z" level=info msg="Waiting for signal"
Do you see nvidia
runtimeClass object created in gpu-operator-resources
namespace and [plugins.cri.containerd.runtimes.nvidia]
stanza set in /etc/containerd/config.toml?
Yep! However, we deploy the cluster with kops and it looks like kops uses /etc/containerd/config-kops.toml for its containerd configuration. I copied the configuration that the container-toolkit container put in config.toml to config-kops.toml, reloaded the containerd config, and the device-plugin container ran successfully.
Is it possible to override the containerd config location via the helm chart?
Yes, you can pass --set toolkit.env[0].name=CONTAINERD_CONFIG --set toolkit.env[0].value="path"
Just found that in the codebase as you commented. I'll give it a shot
That did the trick, thanks for your help!
Does anyone know what the toml file config is for K3s?
After deploying the gpu-operator Helm chart on a cluster with pod security policies enabled, adding a GPU instance to the cluster results in the following events:
Adding a
PodSecurityPolicy
with these permissions and associatedRole
andRoleBinding
for thenvidia-driver
service account fixes the issue.