Open weakcamel opened 8 months ago
Note: I found that old (missing now) documentation section in this issue: https://github.com/NVIDIA/gpu-operator/issues/648
@weakcamel, I'm the guilty party for the reorganizing of the docs. I believe that section from 23.3.2 is still supported--none of the engineers said it wasn't.
I'll work to confirm that it still applies. When I restore the content, I'll locate it somewhere in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/index.html. I apologize for the confusion. Your report suggests that you might be sympathetic to the idea that customizing the metrics might not be a "getting started" task.
I'm OK to work with this issue from here, but won't object if someone scoots by and moves it to github.com/nvidia/cloud-native-docs.
@mikemckiernan No worries at all!
It's really good news that this wasn't a deprecated feature and just a side effect of houskeeping :)
As for the location, I agree it's not necessarily "Getting Started" but I would think it should probably be a part of gpu-operator
(not gpu-telemetry
)? I personally wouldn't ever made the connection to look for gpu-telemetry
. Also, setup of the exporter as part of gpu-operator
is quite significantly different from running it on its own.
Maybe a subsection under Advanced Operator Configuration? There doesn't seem to be any part related to metrics in current docs at all.
@weakcamel since this is a documentation issue, would you mind moving this issue to github.com/nvidia/cloud-native-docs?
I have transferred the issue.
Thanks for moving the ticket - yes, it's perfectly fine.
On a related note, one thing got me thinking: the old documentation snippet explains which options to override (e.g. dcgmExporter.config.name
). yet those options aren't actually documented in the Helm chart docs nor the values file itself. Shouldn't they be?
1. Quick Debug Information
2. Issue or feature description
This is potentially a documentation issue - unless the feature is no longer supported then it's also missing as deprecated from the changelogs.
See that 23.3.2 version of GPU Operator used to support customization of the DCGM Exporter config via a config map:
The 23.5.0 and following docs however are missing this section entirely, e.g.:
Does it mean that this is no longer supported? or is this still allowed and just missed while re-organizing the docs?
3. Steps to reproduce the issue
See the documentation links above.
4. Information to attach (optional if deemed irrelevant)
n/a