NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

nvidia-dcgm-exporter creates huge logs inside container #182

Open boniek83 opened 3 years ago

boniek83 commented 3 years ago

Either its size should be limited by some configurable option, it shouldn't be created at all or pv/pvc should be used. Ephemeral storage ain't free :)

root@nvidia-dcgm-exporter-ck74t:/# du -skh /var/log/*
4.0K    /var/log/alternatives.log
48K     /var/log/apt
60K     /var/log/bootstrap.log
0       /var/log/btmp
184K    /var/log/dpkg.log
4.0K    /var/log/faillog
32K     /var/log/lastlog
1.4G    /var/log/nv-hostengine.log
0       /var/log/wtmp
dualvtable commented 3 years ago

hi @boniek83 - which version of dcgm-exporter are you using?

boniek83 commented 3 years ago

nvcr.io/nvidia/k8s/dcgm-exporter:2.1.4-2.2.0-ubuntu20.04 This is version in the gpu-operator v1.6.2

jfolz commented 3 years ago

I think this may be related to what we're seeing in #194. Our biggest nv-hostengine.log was something like 8+ GB.

IsQiao commented 3 years ago

same issue

treydock commented 3 years ago

Based on feedback from NVIDIA I set the following environment variable to silence the extra logging:

__DCGM_DBG_LVL=NONE

Now the only logs I get in /var/log/nv-hostengine.log is 1 or 2 messages every 30 seconds.

boniek83 commented 3 years ago

Nice but not good enough since it still does log something. We don't know whether amount of data being logged will change between releases. This should be logged to stdout, in dedicated persistent volume or we should just have an option to disable it altogether.