galexrt / dellhw_exporter

Prometheus exporter for Dell Hardware components using Dell OMSA.
https://dellhw-exporter.galexrt.moe
Apache License 2.0
119 stars 41 forks source link

no components found #99

Closed AyoubH closed 8 months ago

AyoubH commented 8 months ago

Hello @galexrt I installed the v1.13.8 versions in some servers, and I just found out that the pdsik error is still there in one of them, except this time when running the omreport command, it shows no fans, ps, or pdisk components

DEBUG mode
ERRO[0024] ps collector failed after 0.089713s: failed to execute command. exit status 255 ERRO[0024] fans collector failed after 0.093299s: failed to execute command. exit status 255 ERRO[0024] storage_pdisk collector failed after 0.470633s: failed to execute command. exit status 255

the omreport shows that there is no disk, although it has two, which I can confirm through iDRAC UI

`List of Physical Disks on Controller PCIe SSD Subsystem (Not Available)

Controller PCIe SSD Subsystem (Not Available)
No Physical Disks found`

Fans and PS collectors
For this one, It occurs in only very few servers, but I can open another discussion for it
the omreport command says also that there is no fans or Powersupply
Error! No fan probes found on this system.
Error! No instrumented power supplies found on this system.
galexrt commented 8 months ago

Regarding the failed to execute command errors, I have released v1.13.9 which will log the command that failed, so we can start debugging the issues.

the omreport shows that there is no disk, although it has two, which I can confirm through iDRAC UI

If omreport isn't reporting it, the exporter can't pick it up. Unless the exporter is running a command with the wrong controller ID(s) there isn't much that I can do as far as I'm aware.

AyoubH commented 8 months ago

The exporter is running with the right controller id which is 0 and I think you're right, if omreport is not reporting any component (pdisk, ps and fans) it is something I need to report to dell, I will keep you updated if there is anything that can help dellhw_exporter project

here below are the logs ERRO[0035] fans collector failed after 0.077419s: failed to execute command ("/opt/dell/srvadmin/bin/omreport [chassis fans -fmt ssv]"). exit status 255 ERRO[0035] ps collector failed after 0.092358s: failed to execute command ("/opt/dell/srvadmin/bin/omreport [chassis pwrsupplies -fmt ssv]"). exit status 255 ERRO[0035] storage_pdisk collector failed after 0.280321s: failed to execute command ("/opt/dell/srvadmin/bin/omreport [storage pdisk controller=0 -fmt ssv]"). exit status 255

AyoubH commented 8 months ago

@galexrt can we ignore these errors when the component is not found?

galexrt commented 8 months ago

@AyoubH Please run the following commands and post the output. The commands should be run in the same environment the exporter is running in (if you use the container image use docker exec/crictl exec or on the host as a binary run the commands directly on the host):

AyoubH commented 8 months ago

@galexrt I got a Dell feedback, the server in question is a blade server which means contained in a Dell - M1000E-1 PowerEdge Blade chassis, this explains the no fans or powersupply not found

They said also that since there is no RAID controller on server, the omreport wont show any disks eventhough the server has two. still checking with why and how to fix it.

I guess that if we can ignore any error related to component not found would be better

below is the output

  1. root@server1 myuser # omreport chassis fans -fmt ssv; echo "Exit Code: $?" Error! No fan probes found on this system. Exit Code: 255

  2. root@server1 myuser# omreport chassis pwrsupplies -fmt ssv; echo "Exit Code: $?" Error! No instrumented power supplies found on this system. Exit Code: 255

  3. root@server1 myuser # omreport storage pdisk controller=0 -fmt ssv; echo "Exit Code: $?" List of Physical Disks on Controller PCIe SSD Subsystem (Not Available) Controller PCIe SSD Subsystem (Not Available) ID;Status;Name;State;Power Status;Bus Protocol;Media;Part of Cache Pool;Remaining Rated Write Endurance;Failure Predicted;Revision;Driver Version;Model Number;T10 PI Capable;Certified;Encryption Capable;Encryption Protocol;Encrypted;Progress;Mirror Set ID;Capacity;Used RAID Disk Space;Available RAID Disk Space;Hot Spare;Vendor ID;Product ID;Serial No.;Part Number;Negotiated Speed;Capable Speed;PCIe Negotiated Link Width;PCIe Maximum Link Width;Sector Size;Device Write Cache;Manufacture Day;Manufacture Week;Manufacture Year;SAS Address;WWN;Non-RAID HDD Disk Cache Policy;Disk Cache Policy;Form Factor ;Sub Vendor;Available Spare;Cryptographic Erase Capable **No Physical Disks found** Exit Code: 255

Thank you

galexrt commented 8 months ago

@AyoubH I'll look at tweaking the logic for such "no components found" cases soon.

galexrt commented 8 months ago

@AyoubH I have released v1.13.10, which ignores exit code 255 now. Please try it and see if it resolves your exporter errors.