galexrt / dellhw_exporter

Prometheus exporter for Dell Hardware components using Dell OMSA.
https://dellhw-exporter.galexrt.moe
Apache License 2.0
111 stars 40 forks source link

Multiple controllers cause duplicate metrics logs #39

Closed HP41 closed 4 years ago

HP41 commented 4 years ago

Been receiving these duplicate error messages:

dellhw_exporter[3781]: time="2020-05-14T12:28:12-04:00" level=info msg="error gathering metrics: collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values"

dellhw_exporter[96877]: time="2020-05-14T12:43:35-04:00" level=info msg="error gathering metrics: 9 error(s) occurred:\n* collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_enclosure_status\" { label:<name:\"enclosure\" value:\"0_1\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_enclosure_status\" { label:<name:\"enclosure\" value:\"0_0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_enclosure_status\" { label:<name:\"enclosure\" value:\"0_1\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"1\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"1\" > gauge:<value:0 > } was collected before with the same name and label values"

dellhw_exporter[96877]: time="2020-05-14T12:43:05-04:00" level=info msg="error gathering metrics: 9 error(s) occurred:\n* collected metric \"dell_hw_storage_enclosure_status\" { label:<name:\"enclosure\" value:\"0_1\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_enclosure_status\" { label:<name:\"enclosure\" value:\"0_0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_enclosure_status\" { label:<name:\"enclosure\" value:\"0_1\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"1\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"1\" > gauge:<value:0 > } was collected before with the same name and label values"

dellhw_exporter[13642]: time="2020-05-14T12:47:02-04:00" level=info msg="error gathering metrics: 2 error(s) occurred:\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values"

These are with OMSA 9.1.0 and 9.4.0

Had a look through and it seems to be a how omreport reports the information, for eg:

The name of device is unique however the scraped fields need not always be unique.

omreport storage battery -fmt ssv
List of Batteries in the System

Controller PERC H840 Adapter  (Slot 1)

ID;Status;Name;State;Recharge Count;Max Recharge Count;Learn State;Next Learn Time;Maximum Learn Delay
0;Ok;Battery ;Ready;Not Applicable;Not Applicable;Not Applicable;Not Applicable;Not Applicable

Controller PERC H740P Mini  (Slot Embedded)

ID;Status;Name;State;Recharge Count;Max Recharge Count;Learn State;Next Learn Time;Maximum Learn Delay
0;Ok;Battery ;Ready;Not Applicable;Not Applicable;Not Applicable;Not Applicable;Not Applicable

The metrics are still collected and displayed without crashing the exporter. However it only displays the metric for the first item.

galexrt commented 4 years ago

Thanks for reporting this issue!


The reason why this problem is occurring is because the ID field in the output is used, but the ID is actually the ID of the battery and / or other "object" requested.

One way to solve the issue would be to simply add a new label, e.g., controller_name which is the name of the controller controller_name="Controller PERC H740P Mini (Slot Embedded)" for all relevant cases.

What do you think about this solution?

HP41 commented 4 years ago

Yeah, I was thinking the same actually. However do you think that name is very label friendly?

As in how standardized is Dell's Controller naming based on controller and OMSA versions. This could affect cardinality perhaps?

galexrt commented 4 years ago

@HP41 I think using Dell's controller names should fine for a label, as the ID right now is completely useless due to omreport. Another way would be to add a logic to "correct" the ID number in dellhw_exporter automatically. But adding a label with the controller name seems to be just more convenient in general (e.g., that allows for the controller name to be added to Prometheus alerts so there is no need to run omreport manually in the container or guess by number).

I'll go ahead and look into working on that this weekend / next week.

HP41 commented 4 years ago

Yeah, you're right. There's no perfect way to fix this as the issue stems from omreport itself and the fix you mention is the cleanest and easiest!

Thank you again!

galexrt commented 4 years ago

@HP41 I have pushed a branch fix_39 which should add the controller_name label to all metrics (where it is / was available).

Please try it out, it should fix the duplicate metrics errors in the logs.

The container image tag fix_39 should be available in a few minutes on Docker Hub and Quay.io.

If you need a binary instead of the container image of the fix_39 branch to test, please let me know.


Could you please send the following omreport command outputs as I would like to make sure the tests are correct for those metrics:

omreport -fmt ssv storage controller
omreport -fmt ssv chassis volts

Thanks in advance for testing and sending me the outputs!

HP41 commented 4 years ago

The baremetal systems do not have docker, so a binary would be helpful!

For the command outputs:

omreport storage controller -fmt ssv
List of Controllers in the system

Controller

ID;Status;Name;Slot ID;State;Firmware Version;Minimum Required Firmware Version;Driver Version;Minimum Required Driver Version;Storport Driver Version;Minimum Required Storport Driver Version;Number of Connectors;Rebuild Rate;BGI Rate;Check Consistency Rate;Reconstruct Rate;Alarm State;Cluster Mode;SCSI Initiator ID;Cache Memory Size;Patrol Read Mode;Patrol Read State;Patrol Read Rate;Patrol Read Iterations;Abort Check Consistency on Error;Allow Revertible Hot Spare and Replace Member;Load Balance;Auto Replace Member on Predictive Failure;Redundant Path view;CacheCade Capable;Persistent Hot Spare;Encryption Capable;Encryption Key Present;Encryption Mode;Preserved Cache;Spin Down Unconfigured Drives;Spin Down Hot Spares;Spin Down Configured Drives;Automatic Disk Power Saving (Idle C);Start Time (HH:MM);Time Interval for Spin Up (in Hours);T10 Protection Information Capable;Non-RAID HDD Disk Cache Policy;Current Controller Mode
0;Ok;PERC H730P Mini;Embedded;Ready;25.5.5.0005;Not Applicable;06.810.09.00-rc1;Not Applicable;Not Applicable;Not Applicable;1;30%;30%;30%;30%;Not Applicable;Not Applicable;Not Applicable;2048 MB;Auto;Stopped;30%;159;Disabled;Enabled;Not Applicable;Disabled;Not Applicable;Not Applicable;Disabled;Yes;No;None;Not Applicable;Disabled;Disabled;Disabled;Disabled;Not Applicable;Not Applicable;No;Unchanged;RAID

Controller

ID;Status;Name;Slot ID;State;Firmware Version;Minimum Required Firmware Version;Driver Version;Minimum Required Driver Version;Storport Driver Version;Minimum Required Storport Driver Version;Number of Connectors;Rebuild Rate;BGI Rate;Check Consistency Rate;Reconstruct Rate;Alarm State;Cluster Mode;SCSI Initiator ID;Cache Memory Size;Patrol Read Mode;Patrol Read State;Patrol Read Rate;Patrol Read Iterations;Abort Check Consistency on Error;Allow Revertible Hot Spare and Replace Member;Load Balance;Auto Replace Member on Predictive Failure;Redundant Path view;CacheCade Capable;Persistent Hot Spare;Encryption Capable;Encryption Key Present;Encryption Mode;Preserved Cache;Spin Down Unconfigured Drives;Spin Down Hot Spares;Spin Down Configured Drives;Automatic Disk Power Saving (Idle C);Start Time (HH:MM);Time Interval for Spin Up (in Hours);T10 Protection Information Capable;Non-RAID HDD Disk Cache Policy
1;Ok;PERC H810 Adapter;PCI Slot 6;Ready;21.3.5-0002;Not Applicable;06.810.09.00-rc1;Not Applicable;Not Applicable;Not Applicable;2;30%;30%;30%;30%;Not Applicable;Not Applicable;Not Applicable;1024 MB;Auto;Stopped;30%;122;Disabled;Enabled;Auto;Disabled;Detected;Yes;Disabled;Yes;No;None;Not Applicable;Disabled;Disabled;Disabled;Disabled;Not Applicable;Not Applicable;No;Not Applicable

Controller

ID;Status;Name;Slot ID;State;Firmware Version;Minimum Required Firmware Version;Driver Version;Minimum Required Driver Version;Storport Driver Version;Minimum Required Storport Driver Version;Number of Connectors;Rebuild Rate;BGI Rate;Check Consistency Rate;Reconstruct Rate;Alarm State;Cluster Mode;SCSI Initiator ID;Cache Memory Size;Patrol Read Mode;Patrol Read State;Patrol Read Rate;Patrol Read Iterations;Abort Check Consistency on Error;Allow Revertible Hot Spare and Replace Member;Load Balance;Auto Replace Member on Predictive Failure;Redundant Path view;CacheCade Capable;Persistent Hot Spare;Encryption Capable;Encryption Key Present;Encryption Mode;Preserved Cache;Spin Down Unconfigured Drives;Spin Down Hot Spares;Spin Down Configured Drives;Automatic Disk Power Saving (Idle C);Start Time (HH:MM);Time Interval for Spin Up (in Hours);T10 Protection Information Capable;Non-RAID HDD Disk Cache Policy
2;Ok;PERC H810 Adapter;PCI Slot 4;Ready;21.3.5-0002;Not Applicable;06.810.09.00-rc1;Not Applicable;Not Applicable;Not Applicable;2;30%;30%;30%;30%;Not Applicable;Not Applicable;Not Applicable;1024 MB;Auto;Stopped;30%;96;Disabled;Enabled;Auto;Disabled;Detected;Yes;Disabled;Yes;No;None;Not Applicable;Disabled;Disabled;Disabled;Disabled;Not Applicable;Not Applicable;No;Not Applicable
omreport chassis volts -fmt ssv
Voltage Probes Information

Health : Ok

Index;Status;Probe Name;Reading;Minimum Warning Threshold;Maximum Warning Threshold;Minimum Failure Threshold;Maximum Failure Threshold
0;Ok;CPU1 VCORE PG;Good;[N/A];[N/A];[N/A];[N/A]
1;Ok;CPU2 VCORE PG;Good;[N/A];[N/A];[N/A];[N/A]
2;Ok;System Board 3.3V PG;Good;[N/A];[N/A];[N/A];[N/A]
3;Ok;System Board 5V AUX PG;Good;[N/A];[N/A];[N/A];[N/A]
4;Ok;CPU2 M23 VPP PG;Good;[N/A];[N/A];[N/A];[N/A]
5;Ok;CPU1 M23 VPP PG;Good;[N/A];[N/A];[N/A];[N/A]
6;Ok;System Board 1.05V PG;Good;[N/A];[N/A];[N/A];[N/A]
7;Ok;System Board BP0 5V PG;Good;[N/A];[N/A];[N/A];[N/A]
8;Ok;CPU1 M23 VDDQ PG;Good;[N/A];[N/A];[N/A];[N/A]
9;Ok;CPU1 M23 VTT PG;Good;[N/A];[N/A];[N/A];[N/A]
10;Ok;System Board 5V SWITCH PG;Good;[N/A];[N/A];[N/A];[N/A]
11;Ok;System Board DIMM PG;Good;[N/A];[N/A];[N/A];[N/A]
12;Ok;System Board VCCIO PG;Good;[N/A];[N/A];[N/A];[N/A]
13;Ok;CPU2 M01 VDDQ PG;Good;[N/A];[N/A];[N/A];[N/A]
14;Ok;CPU1 M01 VDDQ PG;Good;[N/A];[N/A];[N/A];[N/A]
15;Ok;CPU2 M23 VTT PG;Good;[N/A];[N/A];[N/A];[N/A]
16;Ok;CPU2 M01 VTT PG;Good;[N/A];[N/A];[N/A];[N/A]
17;Ok;System Board NDC PG;Good;[N/A];[N/A];[N/A];[N/A]
18;Ok;CPU2 M01 VPP PG;Good;[N/A];[N/A];[N/A];[N/A]
19;Ok;CPU1 M01 VPP PG;Good;[N/A];[N/A];[N/A];[N/A]
20;Ok;CPU2 M23 VDDQ PG;Good;[N/A];[N/A];[N/A];[N/A]
21;Ok;System Board 1.5V PG;Good;[N/A];[N/A];[N/A];[N/A]
22;Ok;OEM fru PS2 PG Fail;Good;[N/A];[N/A];[N/A];[N/A]
23;Ok;System Board PS1 PG Fail;Good;[N/A];[N/A];[N/A];[N/A]
24;Ok;System Board BP1 5V PG;Good;[N/A];[N/A];[N/A];[N/A]
25;Ok;System Board 1.5V AUX PG;Good;[N/A];[N/A];[N/A];[N/A]
26;Ok;CPU1 M01 VTT PG;Good;[N/A];[N/A];[N/A];[N/A]
27;Ok;PS1 Voltage 1;208 V;[N/A];[N/A];[N/A];[N/A]
28;Ok;PS2 Voltage 2;210 V;[N/A];[N/A];[N/A];[N/A]
29;Ok;CPU1 FIVR PG;Good;[N/A];[N/A];[N/A];[N/A]
30;Ok;CPU2 FIVR PG;Good;[N/A];[N/A];[N/A];[N/A]
31;Ok;System Board 2.5V AUX PG;Good;[N/A];[N/A];[N/A];[N/A]

Thank you again!!!

galexrt commented 4 years ago

Thanks for the command outputs!

Interestingly the IDs seem to be right at least for the omreport storage controller command. Still please try out the build, thanks!

I have attached a build from my dev machine here as a zip (can't upload it directly due to GitHub restrictions).

dellhw_exporter.zip (built for 64-bit Linux)

Should it not work, e.g., ldd reporting missing glibc or so, let me know and I'll rebuild it.

HP41 commented 4 years ago

Sorry about the late reply but I gave it a shot:

New exporter:

dellhw_exporter, version 1.4.3 (branch: master, revision: d3152bc1a4277363b3128d83eb0f4ba95e6172aa)
  build user:       atrost@debwrk01
  build date:       20200323-22:56:35
  go version:       go1.14

The issue is still happening with new version:

Jun 02 18:51:02 dellhw_exporter[44422]: time="2020-06-02T18:51:02-04:00" level=info msg="error gathering metrics: 2 error(s) occurred:\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values"

Old Exporter:

dellhw_exporter, version 1.4.3 (branch: HEAD, revision: e0cd1219c6aa61c3958aa0a53aa6f02e78013756)
  build user:       root@9f19e05ae32b
  build date:       20200323-22:23:49
  go version:       go1.13.9

Error:

Jun 02 18:55:02 dellhw_exporter[1927]: time="2020-06-02T18:55:02-04:00" level=info msg="error gathering metrics: 2 error(s) occurred:\n* collected metric \"dell_hw_storage_vdisk_status\" { label:<name:\"vdisk\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values\n* collected metric \"dell_hw_storage_battery_status\" { label:<name:\"controller\" value:\"0\" > gauge:<value:0 > } was collected before with the same name and label values"
galexrt commented 4 years ago

@HP41 Ups, seems that I built on the wrong branch..

I have now rebuilt it on the correct version (please note in the next minor (possibly mayor) release there will be changes to the flags (see README.md).

Here is the correct version of the binary: dellhw_exporter.zip

$ dellhw_exporter --version
dellhw_exporter, version 1.4.3 (branch: fix_39, revision: 2ece399f783ad721a5c08e49a2cd43831b0dca5e)
  build user:       atrost@debwrk01
  build date:       20200604-20:26:45
  go version:       go1.14.3
HP41 commented 4 years ago

Perfect, it worked well!

I was wondering if you're using prom log/libraries like node_exporter?

dellhw_exporter, version 1.4.3 (branch: fix_39, revision: 2ece399f783ad721a5c08e49a2cd43831b0dca5e)
  build user:       atrost@debwrk01
  build date:       20200604-20:26:45
  go version:       go1.14.3
dell_hw_storage_battery_status{controller="0",controller_name="PERC H710 Mini (Slot Embedded)"} 0
dell_hw_storage_battery_status{controller="0",controller_name="PERC H810 Adapter (Slot 5)"} 0
dell_hw_storage_enclosure_status{controller_name="PERC H710 Mini (Embedded)",enclosure="0_1"} 0
dell_hw_storage_enclosure_status{controller_name="PERC H810 Adapter (Slot 5)",enclosure="0_0"} 0
dell_hw_storage_vdisk_status{controller_name="PERC H710 Mini (Embedded)",vdisk="0"} 0
dell_hw_storage_vdisk_status{controller_name="PERC H710 Mini (Embedded)",vdisk="1"} 0
dell_hw_storage_vdisk_status{controller_name="PERC H810 Adapter (Slot 5)",vdisk="0"} 0
galexrt commented 4 years ago

@HP41 Awesome, thanks for confirming! I'm going to merge my PR(s) and create a new release soon.


If I remember correctly I basically "copy'n'pasted" the node_exporter code and worked from there to use the same structure / logic model.

The library for the version output is here: https://github.com/prometheus/common/blob/master/version/info.go I use it in multiple other of my projects as well.

galexrt commented 4 years ago

@HP41 I have switched from Circle CI to GitHub Actions yesterday, the release pipeline is now working again and the latest release v1.5.16 includes the PR that fixed the issue, see https://github.com/galexrt/dellhw_exporter/releases/tag/v1.5.16 (includes binaries again).

HP41 commented 4 years ago

Thank you again!