NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
929 stars 158 forks source link

Let dcgm-exporter be a daemon #367

Open zvonkok opened 4 months ago

zvonkok commented 4 months ago

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

Please provide a clear description of the problem this feature solves

In constraint environments I do not have a shell and cannot easily send a process to the background. Executing dcgm-exporter should give me the possibility to either run it in the foreground or background

Feature Description

Let dcgm-exporter daemonize like nv-hostengine or nvidia-persistenced. Executing dcgm-exporter should get a new argument so it can be automatically be sent to the background.

Describe your ideal solution

Either send dcgm-exporter per default to the background or add a new argument e.g. -d to daemonize the process

Additional context

No response

nvvfedorov commented 3 months ago

@zvonkok , You can use a process manager, such as init.d, systemd, or Supervisor (http://supervisord.org/), to run the DCGM-exporter as a daemon in the background. Will this work for your use case?

zvonkok commented 3 months ago

Unfortunately, not. I am not running any process manager and would like to avoid having such a "huge" dependency for our microVM use case. Additionally, we're running in confidential computing environments where we want to minimize the attack surface by reducing the rootfs size for the guest fs. Another question is can we run dcgm-exporter with a specific user/group does it need per-se root rights?

nvvfedorov commented 3 months ago

@zvonkok , Thank you for providing a use case that justifies the feature request.

Re: Can we run dcgm-exporter with a specific user/group? Does it need per-se root rights?

The DCGM-exporter can be run without root permissions when it uses an embedded DCGM engine. However, DCP metrics aren't available without root access.

However, the DCGM-exporter is able to use a "remote" nv-hostengine, which can be run under root privileges. This remote nv-hostengine provides an endpoint that the DCGM-exporter may use to read GPU metrics, while being run without root privileges.

Here's an example of the command line for running the DCGM-exporter with a connection to the remote nv-hostengine: dcgm-exporter -f ./etc/default-counters.csv -r localhost:5555

Irene-123 commented 3 months ago

Hi @nvvfedorov I went through the issue and looking to contribute here, though need some time for more clarification and understanding Wanted to know if I can take up this as my first issue here or if you have any suggestions lmk :) Thanks!

zvonkok commented 2 months ago

Back from a bigger break. Any follow-up on this?