NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

[Helm] Enable ConfigMap mount by default #350

Closed chipzoller closed 3 months ago

chipzoller commented 4 months ago

Signed-off-by: chipzoller chipzoller@gmail.com

Superseded by #351

Closes #170

Problem

The current version (3.4.2) Helm chart has a couple of problems.

  1. It unnecessarily creates a ConfigMap and corresponding RBAC resources (ServiceAccount, Role, and RoleBinding) which are unused.
  2. It makes it such that users are not able to modify the metrics emitted by default as the container embeds the CSV used to declare those metrics. Because of problem number one, this gives users the illusion that modifying the ConfigMap will result in DCGM-exporter using that as its operating config and it will not.

Users need an extensible ability to modify the metrics exposed by DCGM exporter after it has been deployed. For example, users may need additional software components deployed in the cluster which require non-default metrics sourced from DCGM-exporter. With no way to instruct DCGM-exporter to expose additional metrics, this will not work.

Solution

Two easiest possible solutions to the above problems.

  1. Mount the ConfigMap by default so the operating config is taken from it rather than internally. That is what this PR does.
  2. Remove the default creation of the ConfigMap and supporting RBAC resources, and create new Helm values which allow users to opt-in to them.

This PR is intended as a conversation starter with the maintainers to see which approach, or possibly another, they would prefer.

After these changes, the net effect for users will be zero as the CSV used internally and that which is provided in the Helm chart are identical aside from some descriptors. This should be a non-disruptive upgrade for users.

If maintainers are open to the idea, my suggestion to make this more flexible for users is to allow users to specify contents of the ConfigMap in-line to the values file so they can include custom metrics during deployment.

chipzoller commented 4 months ago

cc @nvvfedorov and @glowkey for your suggestions/input.

nvvfedorov commented 4 months ago

@chipzoller , Thank you for your contribution. We will review, test and let you know results.

chipzoller commented 4 months ago

Relates to #351

chipzoller commented 4 months ago

Any feedback here?

glowkey commented 4 months ago

Sorry for the delay. We hope to provide feedback on which solution might be preferable within the next few weeks. We're also hoping to see others chime in with feedback.

chipzoller commented 4 months ago

Just following up here to see if there's been any discussion on these proposed changes. At the least I'd be curious to know why these changes (or alternatively the second solution) would not be desirable to have.

nvvfedorov commented 3 months ago

@chipzoller, We need to take some time to review, test the changes, and make a decision.

chipzoller commented 3 months ago

Closing in favor of #351 where changes from this PR have been incorporated.