NVIDIA / cloud-native-docs

Documentation repository for NVIDIA Cloud Native Technologies
https://docs.nvidia.com/datacenter/cloud-native/
Apache License 2.0
16 stars 18 forks source link

Are custom DCGM Exporter Metrics supported anymore? #22

Open weakcamel opened 8 months ago

weakcamel commented 8 months ago

1. Quick Debug Information

2. Issue or feature description

This is potentially a documentation issue - unless the feature is no longer supported then it's also missing as deprecated from the changelogs.

See that 23.3.2 version of GPU Operator used to support customization of the DCGM Exporter config via a config map:

The 23.5.0 and following docs however are missing this section entirely, e.g.:

Does it mean that this is no longer supported? or is this still allowed and just missed while re-organizing the docs?

3. Steps to reproduce the issue

See the documentation links above.

4. Information to attach (optional if deemed irrelevant)

n/a

weakcamel commented 8 months ago

Note: I found that old (missing now) documentation section in this issue: https://github.com/NVIDIA/gpu-operator/issues/648

mikemckiernan commented 8 months ago

@weakcamel, I'm the guilty party for the reorganizing of the docs. I believe that section from 23.3.2 is still supported--none of the engineers said it wasn't.

I'll work to confirm that it still applies. When I restore the content, I'll locate it somewhere in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/index.html. I apologize for the confusion. Your report suggests that you might be sympathetic to the idea that customizing the metrics might not be a "getting started" task.

I'm OK to work with this issue from here, but won't object if someone scoots by and moves it to github.com/nvidia/cloud-native-docs.

weakcamel commented 7 months ago

@mikemckiernan No worries at all!

It's really good news that this wasn't a deprecated feature and just a side effect of houskeeping :)

As for the location, I agree it's not necessarily "Getting Started" but I would think it should probably be a part of gpu-operator (not gpu-telemetry)? I personally wouldn't ever made the connection to look for gpu-telemetry. Also, setup of the exporter as part of gpu-operator is quite significantly different from running it on its own.

Maybe a subsection under Advanced Operator Configuration? There doesn't seem to be any part related to metrics in current docs at all.

Screenshot 2024-02-17 at 08 50 09

cdesiniotis commented 7 months ago

@weakcamel since this is a documentation issue, would you mind moving this issue to github.com/nvidia/cloud-native-docs?

elezar commented 7 months ago

I have transferred the issue.

weakcamel commented 7 months ago

Thanks for moving the ticket - yes, it's perfectly fine.

On a related note, one thing got me thinking: the old documentation snippet explains which options to override (e.g. dcgmExporter.config.name). yet those options aren't actually documented in the Helm chart docs nor the values file itself. Shouldn't they be?