ClusterCockpit / cc-metric-collector

A node agent for measuring, processing and forwarding node level metrics
MIT License
4 stars 7 forks source link

Little fixes to the prometheus sink #115

Open oscarminus opened 7 months ago

oscarminus commented 7 months ago
  1. Nvlink error counters are unsigned integers, this option was missing in the cast method
  2. Prometheus was no available sink although the code was there
  3. NVlink errors are gathered by gpu and link. The prometheus exporter did not recognize the link id and therefore only reports the last link of each gpu. For our case, the sum of all links per gpu is the interesting value. So we implement this as a seperate metric.