bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
110 stars 30 forks source link

Incorrect status when server is powered off (Dell poweredge R740) #110

Closed weeboo closed 1 year ago

weeboo commented 1 year ago

Hello, my server is a Dell PowerEdge R740 when the system is powered off, the status does not represant reality :

.\CSH-SYS-x86_check_redfish.py -H x.x.x.x -f xxxxxxxxx.txt --info --detailed CRITICAL: INFO: Dell Inc. PowerEdge R740 (CPU: 1, MEM: 128GB) - BIOS: 2.17.1 - Serial: Cxxxxxxxxxx0 - ServiceTag: J8P6SK3 - Power: Off - Name: NOT SET

When the status of a sensor is unknown, why affect the critical status and not the unknown status ? And, is it possible to exclude Unknown sensor when the server is in powered off state ?

bb-Ricardo commented 1 year ago

Hi,

good point, I will have a look at it.

lgmu commented 1 year ago

Hi, similar problem on a HPE ProLiant BL460c Gen10

When the server is powered off, the FAN and Memory Checks go critical

[OK]: INFO: HPE ProLiant BL460c Gen10 (CPU: 2, MEM: 64GB) - BIOS: I41 v1.46 (10/02/2018) - Serial: *** - Power: Off - Name: NOT SET [CRITICAL]: Chassi 1 : Fan '1' (0%) status is: UnavailableOffline [CRITICAL]: Chassi enclosurechassis : Fan '1' (0%) status is: UnavailableOffline [CRITICAL]: Memory module PROC 1 DIMM 1 (0.0GB) status is: None [CRITICAL]: Memory module PROC 1 DIMM 2 (16.0GB) status is: None [CRITICAL]: Memory module PROC 1 DIMM 3 (16.0GB) status is: None [CRITICAL]: Memory module PROC 1 DIMM 4 (0.0GB) status is: None [CRITICAL]: Memory module PROC 1 DIMM 5 (0.0GB) status is: None [CRITICAL]: Memory module PROC 1 DIMM 6 (0.0GB) status is: None [CRITICAL]: Memory module PROC 1 DIMM 7 (0.0GB) status is: None [CRITICAL]: Memory module PROC 1 DIMM 8 (0.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 1 (0.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 2 (16.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 3 (16.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 4 (0.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 5 (0.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 6 (0.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 7 (0.0GB) status is: None [CRITICAL]: Memory module PROC 2 DIMM 8 (0.0GB) status is: None [OK]: BMC: iLO 5 (Firmware: iLO 5 v2.72) and all nics are in 'OK' state. [OK]: All network adapter (1) and ports (0) are in good condition [OK]: Chassi 1 : No power supplies detected [OK]: Chassi enclosurechassis : No power supplies detected [OK]: All processors (2) are in good condition [OK]: Chassi 1 : All temp sensors (0) are in good condition [OK]: Chassi enclosurechassis : All temp sensors (0) are in good condition

bb-Ricardo commented 1 year ago

Hey, I just pushed a change to the next-release branch. Can you check it out and test if it works now?

thank you.

lgmu commented 1 year ago

Hi, thanks! I tried the latest changes on the next-release branch:

On the same Server it works great now:

[OK]: BMC: iLO 5 (Firmware: iLO 5 v2.72) and all nics are in 'OK' state.
[OK]: Chassi 1 : All fans (1) are in good condition
[OK]: Chassi enclosurechassis : All fans (1) are in good condition
[OK]: All 16 memory modules (Total 64.0GB) are in good condition
[OK]: All network adapter (1) and ports (0) are in good condition
[OK]: Chassi 1 : No power supplies detected
[OK]: Chassi enclosurechassis : No power supplies detected
[OK]: All processors (2) are in good condition
[OK]: INFO: HPE ProLiant BL460c Gen10 (CPU: 2, MEM: 64GB) - BIOS: I41 v1.46 (10/02/2018) - Serial: *** - Power: Off - Name: NOT SET
[OK]: Chassi 1 : All temp sensors (0) are in good condition
[OK]: Chassi enclosurechassis : All temp sensors (0) are in good condition

I've found some other servers though:

[CRITICAL]: Processor CPU.Socket.1 (Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz) status is: None
[CRITICAL]: Processor CPU.Socket.2 (Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz) status is: None
[OK]: All fans (8) are in good condition
[OK]: All 16 memory modules (Total 1024.0GB) are in good condition
[OK]: All network adapter (3) and ports (5) are in good condition
[OK]: All power supplies (2) are in good condition and 1 Voltages are OK
[OK]: One or more storage components report an issue
[OK]: INFO: Dell Inc. PowerEdge C6420 (CPU: 2, MEM: 1024GB) - BIOS: 2.11.2 - Serial: *** - ServiceTag: *** - Power: Off - Name: NOT SET - 32 health sensors are in 'OK' state
[OK]: All temp sensors (1) are in good condition
|'ps_1'=266 'ps_2'=27 'temp_Inlet_Temp'=21.0;43;47 'Fan_1A'=-2147483648;; 'Fan_1B'=-2147483648;; 'Fan_2A'=-2147483648;; 'Fan_2B'=-2147483648;; 'Fan_3A'=-2147483648;; 'Fan_3B'=-2147483648;; 'Fan_4A'=-2147483648;; 'Fan_4B'=-2147483648;; 

For CPU it's not working yet and also I recieve negative integer overflow for the fans

And on another server I have problems with the temp sensors when turned off:

[CRITICAL]: Temp sensor 01-Inlet Ambient status is: Offline (0.0 °C) (max: 42.0 °C)
[CRITICAL]: Temp sensor 02-CPU 1 status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 03-CPU 2 status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 04-P1 DIMM 1-6 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 05-P1 DIMM 7-12 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 06-P2 DIMM 1-6 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 07-P2 DIMM 7-12 status is: Offline (0.0 °C) (max: 89.0 °C)
[CRITICAL]: Temp sensor 08-HD Max status is: Offline (0.0 °C) (max: 60.0 °C)
[CRITICAL]: Temp sensor 09-Exp Bay Drive status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 10-Chipset status is: Offline (0.0 °C) (max: 105.0 °C)
[CRITICAL]: Temp sensor 11-PS 1 Inlet status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 12-PS 2 Inlet status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 13-VR P1 status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 14-VR P2 status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 15-VR P1 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 16-VR P1 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 17-VR P2 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 18-VR P2 Mem status is: Offline (0.0 °C) (max: 115.0 °C)
[CRITICAL]: Temp sensor 19-PS 1 Internal status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 20-PS 2 Internal status is: Offline (0.0 °C) (max: N/A °C)
[CRITICAL]: Temp sensor 21-PCI 1 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 22-PCI 2 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 23-PCI 3 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 24-PCI 4 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 25-PCI 5 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 26-PCI 6 status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 27-HD Controller status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 28-LOM Card status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 29-LOM status is: Offline (0.0 °C) (max: 100.0 °C)
[CRITICAL]: Temp sensor 30-Front Ambient status is: Offline (0.0 °C) (max: 65.0 °C)
[CRITICAL]: Temp sensor 31-PCI 1 Zone. status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 32-PCI 2 Zone. status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 33-PCI 3 Zone. status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 34-PCI 4 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 35-PCI 5 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 36-PCI 6 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 37-HD Cntlr Zone status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 38-I/O Zone status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 39-P/S 2 Zone status is: Offline (0.0 °C) (max: 70.0 °C)
[CRITICAL]: Temp sensor 40-Battery Zone status is: Offline (0.0 °C) (max: 75.0 °C)
[CRITICAL]: Temp sensor 41-iLO Zone status is: Offline (0.0 °C) (max: 90.0 °C)
[CRITICAL]: Temp sensor 42-Rear HD Max status is: Offline (0.0 °C) (max: 60.0 °C)
[CRITICAL]: Temp sensor 43-Storage Batt status is: Offline (0.0 °C) (max: 60.0 °C)
[CRITICAL]: Temp sensor 44-Fuse status is: Offline (0.0 °C) (max: 100.0 °C)
[OK]: BMC: iLO 4 (Firmware: iLO 4 v2.81) and all nics are in 'OK' state.
[OK]: All fans (6) are in good condition
[OK]: All 5 memory modules (Total 48.0GB) are in good condition
[OK]: All network adapter (1) and ports (4) are in good condition
[OK]: All power supplies (0) are in good condition
[OK]: All processors (1) are in good condition
[OK]: INFO: HPE ProLiant DL380 Gen9 (CPU: 1, MEM: 48GB) - BIOS: P89 v2.64 (10/17/2018) - Serial: *** - Power: Off - Name: ***
lgmu commented 1 year ago

Here I additionally recieve a CRITICAL because of a Unknown Battery RAID Controller Status:

[CRITICAL]: Processor CPU.Socket.2 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Processor CPU.Socket.4 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Processor CPU.Socket.1 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Processor CPU.Socket.3 (Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz) status is: None
[CRITICAL]: Battery on RAID Controller in Slot 1 Status: Unknown
[OK]: BMC: iDRAC 9 (Firmware: 5.10.30.00) and all nics are in 'OK' state.
[OK]: Chassi has no fans installed/reported
[OK]: All 32 memory modules (Total 2048.0GB) are in good condition
[OK]: All network adapter (5) and ports (12) are in good condition
[OK]: All power supplies (2) are in good condition and Power redundancy 1 status is: Disabled and 1 Voltages are OK
[OK]: INFO: Dell Inc. PowerEdge R840 (CPU: 4, MEM: 2048GB) - BIOS: 2.14.2 - Serial: *** - ServiceTag: *** - Power: Off - Name: NOT SET - 57 health sensors are in 'OK' state
[OK]: All temp sensors (1) are in good condition
bb-Ricardo commented 1 year ago

uiui, this needs a more general approach then

can you check if the negative fan values are being sent directly from the iDRAC?

lgmu commented 1 year ago

Yes, sorry I didn't check that before:

'Fans': [{'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Thermal#/Fans/0',
           '@odata.type': '#Thermal.v1_7_1.Fan',
           'Assembly': {'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Assembly'},
           'FanName': 'FAN1A',
           'HotPluggable': False,
           'LowerThresholdCritical': None,
           'LowerThresholdFatal': None,
           'LowerThresholdNonCritical': None,
           'MaxReadingRange': None,
           'MemberId': '0',
           'MinReadingRange': None,
           'Name': 'FAN1A',
           'PhysicalContext': 'Fan',
           'Reading': -2147483648,
           'ReadingUnits': 'RPM',
           'Redundancy': [],
           'Redundancy@odata.count': 0,
           'RelatedItem': [{'@odata.id': '/redfish/v1/Chassis/System.Embedded.1'}],
           'RelatedItem@odata.count': 1,
           'SensorNumber': 56,
           'Status': {'Health': None, 'State': None},
           'UpperThresholdCritical': None,
           'UpperThresholdFatal': None,
           'UpperThresholdNonCritical': None},
          {'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Thermal#/Fans/1',
           '@odata.type': '#Thermal.v1_7_1.Fan',
           'Assembly': {'@odata.id': '/redfish/v1/Chassis/System.Embedded.1/Assembly'},
           'FanName': 'FAN1B',
           'HotPluggable': False,
           'LowerThresholdCritical': None,
           'LowerThresholdFatal': None,
           'LowerThresholdNonCritical': None,
           'MaxReadingRange': None,
           'MemberId': '1',
           'MinReadingRange': None,
           'Name': 'FAN1B',
           'PhysicalContext': 'Fan',
           'Reading': -2147483648,
           'ReadingUnits': 'RPM',
           ...
bb-Ricardo commented 1 year ago

😄, well, This is quite something. I should add some sanity checks to the returned values and if they are out of range then they should default to 0.

What do you think?

lgmu commented 1 year ago

Sounds good!

weeboo commented 1 year ago

This is better but I have the same problem with the --proc check when the system is powered off : My server is a Dell poweredge R640

[CRITICAL]: Processor CPU.Socket.1 (Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz) status is: None [CRITICAL]: Processor CPU.Socket.2 (Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz) status is: None

bb-Ricardo commented 1 year ago

Will fix it in the next version but wont be until after easter holiday break.

bb-Ricardo commented 1 year ago

Hey @weeboo, @lgmu,

I just pushed another commit to next-release.

Would you mind testing it?

lgmu commented 1 year ago

Hey, works great now. Thanks!

One thing I've noticed (on Hosts that are Power: on):

Sometimes I randomly get [CRITICAL]: Processor Proc 1 (Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz) status is: None but on the next reschedule it's OK again - when I check the logs I don't see that this has happened before - I'll need to keep an eye on this and will give you feedback. But I don't see any change in the code that would have changed this behaviour

bb-Ricardo commented 1 year ago

Good to hear. But this behavior only occurs with the latest change? And not with older versions?

lgmu commented 1 year ago

I don't know yet, I couldn't reproduce it on the command line - I'll check after the weekend

weeboo commented 1 year ago

Now it's OK for --proc but the problem is now with --power : [CRITICAL]: Power supply 1 (PWR SPLY,1100W,RDNT,LTON) status is: None [CRITICAL]: Power supply 2 (PWR SPLY,1100W,RDNT,LTON) status is: None

bb-Ricardo commented 1 year ago

Hi,

I especially left out the power supply section. This should be monitored correctly by the BMC even if the server is switched off. I assume it would be important if a power supply fails when the server is in standby.

What do you think?

weeboo commented 1 year ago

The BMC say the status is None but you affect the CRITICAL status. I think, it's not consistent. In my opinion, the None could be asign to the unknown status. In this case, le system is powered off, so all none status could be ignore.

lgmu commented 1 year ago

Good to hear. But this behavior only occurs with the latest change? And not with older versions?

Seems to be fine, didn't see any more Criticals

bb-Ricardo commented 1 year ago

Hi,

I just pushed another commit regarding status of power supply if server is switched off. Can you try again please?

bb-Ricardo commented 1 year ago

@weeboo, @lgmu: any chance testing this commit?

weeboo commented 1 year ago

I will try today

weeboo commented 1 year ago

All seems to be fine now thanks !!

bb-Ricardo commented 1 year ago

Thank you for testing, then I will close this issue.