Open weistonedawei opened 1 month ago
@weistonedawei for comparison, do you have the behaviour for a similar setup that isn't using the nvidia-container-runtime
?
Also, what are the contents of /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
?
@elezar I updated my previous comment to include config.toml content, which is installed by gpu-operator. container engine is containerd.
Another installation using container engine Docker and installation from nvidia apt package repo nvidia-container-runtime on the worker node host also has excessive logging but without "Using config ..." entries. The version of nvidia-container-runtime is 1.11.0. This installation uses Docker container engine. Slower '/run' utilization but definitely filling up the '/run' tmpfs.
NVIDIA Container Runtime version 1.11.0
commit: d9de4a0
spec: 1.0.2-dev
On installations without nvidia-container-runtime, '/run' tpmfs mount utilization is below 1%. ContainerD is the engine.
Easily reproducible:
1) install gpu-operator in a k8s cluster
2) create a POD that uses exec livenessProbe
3) login to the node on with POD with exec livenessProbe is running and df -h /run
Thanks for checking it out.
@weistonedawei I think we can reduce the info
logging in cases where we are not creating a container. I would propose
Debug
or Trace
Debug
if a create command is issued and logging it at Trace
if this is not the case.Note that this will only be available as an update to the 1.15.x
version of the toolkit.
Observed Kubernetes workload deployment failure caused by excessive logging in /run/containerd/io.containerd.runtime.v2.task/k8s.io//log.json file. This leads to /run tmpfs mount to be at 100% utilization, which prevents further container creation on the affected node.
When container spec uses
exec
livenessProbe, the following log entries will be logged:A sample container spec:
The log entries come from runtime.go starting on line 75 and from runtime_low_level.go code.
IMHO, setting log level to DEBUG should be fine; it would allow easy debugging and not affecting functionalities.
Current workaround used is to set log-level = "error" in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml.
I used gpu-operator in the Kubernetes cluster and here is the runtime version info:
Attempted to create /etc/nvidia-container-runtime/config.toml to override log-level did not work.