Closed matejzero closed 3 years ago
It looks like the new BMC reports 2 chassis for some reason:
{
"Members": [
{
"@odata.id": "/redfish/v1/Chassis/1"
},
{
"@odata.id": "/redfish/v1/Chassis/3"
}
],
"@odata.type": "#ChassisCollection.ChassisCollection",
"@odata.id": "/redfish/v1/Chassis",
"Name": "ChassisCollection",
"@odata.etag": "\"234145c889472ae2565\"",
"Members@odata.count": 2,
"Description": "A collection of Chassis resource instances."
}
Chassis 1 reports all info, but chassis 3 output looks like so:
{
"SerialNumber": "xxxxxxxx",
"Id": "3",
"Name": "Backplane",
"@odata.id": "/redfish/v1/Chassis/3",
"SKU": "01GV280",
"Oem": {
"Lenovo": {
"PRODUCT_ID": "0000",
"VPD_ID": "0070",
"Entity_ID": "0f",
"Device_ID": "51",
"POS_ID": "006a"
}
},
"@odata.type": "#Chassis.v1_10_0.Chassis",
"ChassisType": "Enclosure",
"PartNumber": "SC57A01986",
"@odata.etag": "\"32915858356a2a24fc8\"",
"Manufacturer": "LNVO",
"Description": "This resource is used to represent a chassis or other physical enclosure for a Redfish implementation."
}
Looking at the output, this is a backplane / enclosure resource, which might provide more info in later versions, but for now there is not much data here.
Looking at the changelog I found this regarding the chassis: Added the Redfish support of Enclosure "Chassis" object on blade and dense systems.
oh wow. interesting. I might have to adapt the plugin.
I overlooked, temperature is also reporting the same problem.
Yes, chassis 1 contains all necessary data. ChassisType
of chassis 1 is RackMount
. Maybe a quick workaround would be to check if ChassisType
is RackMount
, but I'm not sure if that causes problems for other types of servers.
I would take another approach. I would collect data and only complain if no data for temp or anything returned at all. As long as one chassi returns data it should report as green.
Will have a look at it.
That is probably a better approach.
In case backplane endpoint starts returning temperature / power / fans (blade or dense systems where there is a separage storage chassis under the server bay), your solution will cover that.
Looking at Lenovo SD530, it might be that this reports chassis 1 as main chassis (power supply and some temperatures) and 2,3,4,5 as each node (fans, power, temperature)...
I would be highly interested in mockups to add to my testing environment. Would be great if you could provide some. Also makes coding against it much easier.
I can't provide mockups for SD530 as we don't have them. As for SR6x0 and BMC/XCC 5.40, the mockup is above (for chassis 3).
I will also love to test anx fixes you make.
So, I took quite long but now I tried to take care of this issue.
can you check out next-release and see if this fixes your issue.
Power supply checks work OK:
[OK]: All power supplies (2) are in good condition and Power redundancy 1 status is: Enabled|'ps_1'=99 'ps_2'=99
Temperature checks work OK:
[OK]: |'temp_Ambient_Temp'=26.0;43;47 'temp_CPU1_Temp'=48.0 'temp_CPU1_DTS'=-49.0 'temp_DIMM_3_Temp'=39.0 'temp_DIMM_4_Temp'=39.0 'temp_DIMM_5_Temp'=39.0 'temp_DIMM_6_Temp'=39.0 'temp_DIMM_7_Temp'=37.0 'temp_DIMM_8_Temp'=37.0 'temp_DIMM_9_Temp'=36.0 'temp_DIMM_10_Temp'=36.0 'temp_PCH_Temp'=64.0 'temp_Exhaust_Temp'=48.0
Fans:
[UNKNOWN]: Request error: No fan data returned for API URL '/redfish/v1/Chassis/1/Thermal', No fan data returned for API URL '/redfish/v1/Chassis/3/Thermal'
Chassis/1/ json output: https://pastebin.com/2UqTtiAg Chassis/3/ json output: https://pastebin.com/6sgN2F81
I ran the fans check again and now it works., but the output is different.
Check on old version:
[OK]: All fans (10) are in good condition|'Fan_Fan_1A_Tach'=5460;; 'Fan_Fan_1B_Tach'=5340;; 'Fan_Fan_2A_Tach'=5376;; 'Fan_Fan_2B_Tach'=5251;; 'Fan_Fan_3A_Tach'=5376;; 'Fan_Fan_3B_Tach'=5162;; 'Fan_Fan_4A_Tach'=5460;; 'Fan_Fan_4B_Tach'=5251;; 'Fan_Fan_5A_Tach'=5208;; 'Fan_Fan_5B_Tach'=5162;;
Check on new version:
[OK]: |'Fan_Fan_1A_Tach'=5124;; 'Fan_Fan_1B_Tach'=4895;; 'Fan_Fan_2A_Tach'=5208;; 'Fan_Fan_2B_Tach'=4895;; 'Fan_Fan_3A_Tach'=5040;; 'Fan_Fan_3B_Tach'=4984;; 'Fan_Fan_4A_Tach'=5040;; 'Fan_Fan_4B_Tach'=4895;; 'Fan_Fan_5A_Tach'=5040;; 'Fan_Fan_5B_Tach'=4806;;
Thank you for testing it. will check it out
Edit: I found the problem. This will cause a much bigger change then I anticipated. But in the end we will be able to support multiple chassis, systems and managers in every server/blade center.
That sound great!! Can't wait to test it out:)
Hey @matejzero,
It took quite a while but finally finished the change. Can you please test the 'next-release' branch and let me know if this works for you?
Thank you.
I can confirm the new version works on Lenovo SR630/SR650 with XCC firmware versions 5.42 (latest) and 4.80 (pre-latest), apart from no mel/sel logs, but that doesn't work on latest release either:
[UNKNOWN]: No log services discovered where name matches 'Manager'
[UNKNOWN]: No log services discovered where name matches 'System'
All checks are green on Dell R6515 (iDrac 4.10.10.10 and 4.30.30.30) and R640 (iDrac 4.10.10.10), but I get some errors on a R740 (iDrac 4.22.00.53) that weren't present in latest release:
storage check
New version: [CRITICAL]: PERC H730P Mini status: OK
Old version: [OK]: All storage controllers (PERC H730P Mini PERC H730P Mini, C620 Series Chipset Family SSATA Controller [AHCI mode] C620 Series Chipset Family SSATA Controller [AHCI mode], C620 Series Chipset Family SATA Controller [AHCI mode] C620 Series Chipset Family SATA Controller [AHCI mode], PERC H730P Mini), volumes and disk drives are in good condition
info check
New version: [CRITICAL]: Type: Dell Inc. PowerEdge R740 (CPU: 1, MEM: 512GB) - BIOS: 2.9.4 - Serial: xxxx - Power: On - Name: NOT SET - 1 health sensor in 'CRITICAL' state, 34 health sensors are in 'OK' state
Old version: [OK]: Type: Dell Inc. PowerEdge R740 (CPU: 1, MEM: 512GB) - BIOS: 2.9.4 - Serial: xxxx - Power: On - Name: NOT SET
I only have one R740 to test, but iDrac is reporting the system is all green. I tried looking info output in verbose if any HealthState is reported as Critical, but everyting is OK or Unknown. Let me know how I can help further debug the issue to make it simpler for you.
Thanks for fixing the check so far!
I can confirm the new version works on Lenovo SR630/SR650 with XCC firmware versions 5.42 (latest) and 4.80 (pre-latest), apart from no mel/sel logs, but that doesn't work on latest release either:
[UNKNOWN]: No log services discovered where name matches 'Manager'
[UNKNOWN]: No log services discovered where name matches 'System'
If you could provide me with a MockUP i can check and integrate this as well.
All checks are green on Dell R6515 (iDrac 4.10.10.10 and 4.30.30.30) and R640 (iDrac 4.10.10.10), but I get some errors on a R740 (iDrac 4.22.00.53) that weren't present in latest release:
- storage check New version:
[CRITICAL]: PERC H730P Mini status: OK
Old version:[OK]: All storage controllers (PERC H730P Mini PERC H730P Mini, C620 Series Chipset Family SSATA Controller [AHCI mode] C620 Series Chipset Family SSATA Controller [AHCI mode], C620 Series Chipset Family SATA Controller [AHCI mode] C620 Series Chipset Family SATA Controller [AHCI mode], PERC H730P Mini), volumes and disk drives are in good condition
This seems to be a bug.
- info check New version:
[CRITICAL]: Type: Dell Inc. PowerEdge R740 (CPU: 1, MEM: 512GB) - BIOS: 2.9.4 - Serial: xxxx - Power: On - Name: NOT SET - 1 health sensor in 'CRITICAL' state, 34 health sensors are in 'OK' state
Old version:[OK]: Type: Dell Inc. PowerEdge R740 (CPU: 1, MEM: 512GB) - BIOS: 2.9.4 - Serial: xxxx - Power: On - Name: NOT SET
There seems to be one component not filtered properly.
Can you please run both commands in --detailed
option and post the output here?
Thank you.
I'll try and get the mockup for logs, but I need to find out which endpoint URI the script is calling to collect the document. If you can give me the URI (so that I won't need to look through verbose output), I'll be able to generate it quicker.
Detailed output of storage check:
[CRITICAL]: PERC H730P Mini status: OK
[OK]: PERC H730P Mini PERC H730P Mini (FW: 25.5.7.0005) status is: OK
[OK]: Physical Drive Solid State Disk 0:1:0 (MTFDDAK480TDC / SSD / SATA) 479.56GiB status: OK
[OK]: Physical Drive Solid State Disk 0:1:1 (MTFDDAK480TDC / SSD / SATA) 479.56GiB status: OK
[OK]: Logical Drive VD_0 (VD_0) 480GiB (Mirrored) status: OK
[OK]: StorageEnclosure BP14G+ 0:1 (Power: On) status: OK
[OK]: C620 Series Chipset Family SSATA Controller [AHCI mode] C620 Series Chipset Family SSATA Controller [AHCI mode] (FW: None) status is: None
[OK]: C620 Series Chipset Family SATA Controller [AHCI mode] C620 Series Chipset Family SATA Controller [AHCI mode] (FW: None) status is: None
[OK]: MICRON Solid State Disk 0:1:0 MTFDDAK480TDC (size: 479.56 GiB) status: OK
[OK]: MICRON Solid State Disk 0:1:1 MTFDDAK480TDC (size: 479.56 GiB) status: OK
[OK]: DELL Backplane 1 on Connector 0 of Integrated RAID Controller 1 BP14G+ 0:1 status: OK
info check:
[CRITICAL]: Type: Dell Inc. PowerEdge R740 (CPU: 1, MEM: 512GB) - BIOS: 2.9.4 - Serial: xxxx - Power: On - Name: NOT SET
[CRITICAL]: Sensor "CPU2 Status": Unknown (Enabled/Unknown)
[OK]: Sensor "CPU1 FIVR PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 MEM012 VDDQ PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 MEM012 VPP PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 MEM012 VTT PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 MEM345 VDDQ PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 MEM345 VPP PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 MEM345 VTT PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 Status": OK (Enabled/Good)
[OK]: Sensor "CPU1 VCCIO PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 VCORE PG": OK (Enabled/Good)
[OK]: Sensor "CPU1 VSA PG": OK (Enabled/Good)
[OK]: Sensor "DIMM SLOT A10": OK (Enabled/Presence Detected)
[OK]: Sensor "DIMM SLOT A11": OK (Enabled/Presence Detected)
[OK]: Sensor "DIMM SLOT A2": OK (Enabled/Presence Detected)
[OK]: Sensor "DIMM SLOT A4": OK (Enabled/Presence Detected)
[OK]: Sensor "DIMM SLOT A5": OK (Enabled/Presence Detected)
[OK]: Sensor "DIMM SLOT A7": OK (Enabled/Presence Detected)
[OK]: Sensor "DIMM SLOT A8": OK (Enabled/Presence Detected)
[OK]: Sensor "System Board 1.8V SW PG": OK (Enabled/Good)
[OK]: Sensor "System Board 2.5V SW PG": OK (Enabled/Good)
[OK]: Sensor "System Board 3.3V B PG": OK (Enabled/Good)
[OK]: Sensor "System Board 5V SW PG": OK (Enabled/Good)
[OK]: Sensor "System Board BP0 PG": OK (Enabled/Good)
[OK]: Sensor "System Board BP1 PG": OK (Enabled/Good)
[OK]: Sensor "System Board BP2 PG": OK (Enabled/Good)
[OK]: Sensor "System Board CMOS Battery": OK (Enabled/Good)
[OK]: Sensor "System Board DIMM PG": OK (Enabled/Good)
[OK]: Sensor "System Board Intrusion": OK (Enabled/No Breach)
[OK]: Sensor "System Board NDC PG": OK (Enabled/Good)
[OK]: Sensor "System Board PS1 PG FAIL": OK (Enabled/Good)
[OK]: Sensor "System Board PS2 PG FAIL": OK (Enabled/Good)
[OK]: Sensor "System Board PVNN SW PG": OK (Enabled/Good)
[OK]: Sensor "System Board VSB11 SW PG": OK (Enabled/Good)
[OK]: Sensor "System Board VSBM SW PG": OK (Enabled/Good)
I can see the info output, there is Enabled/Unknown
for sensor CPU2 Status
. This server supports 2 CPUs, but only 1 is installed.
There is a refdfish mockup generator on github. I usually use that one: https://github.com/DMTF/Redfish-Mockup-Creator
Could I send the mockup to you via email as I don't want to post it here due to serial numbers included in the mockup.
Absolutely
I don't seem to find your email on github. Could you send me an email to xxx at yyy and I'll send you the link to mockup files.
You can find it here: https://github.com/bb-Ricardo/check_redfish/blob/master/check_redfish.py#L21
I saw you made some commits. Checks on R740 now pass without a problem.
Thank you for testing. I just pushed another commit.
Now the Logs on Lenovo Systems should work again. Also added Controller Cache Battery Infos for newer DELL and Lenovo Systems
I tested latest version on SR630/SR650 and Dell R6515, R640 and R740 and all works OK!
We also have a lot of SR635 servers, but they are too slow for querying at the moment. Need to do more testing, but just querying base redfish URI can take between 3s and 30s+, so I need to do more testing and then try this check.
Anyway, I think all issues are now fixed and you can close this.
Thank you very much for this fixes!
Great and thank you for all the testing.
Today I upgraded some Lenovo SR630 servers to latest BMC (5.40 (Build ID: CDI364M)).
After the upgrade, power and fans checks started to fail.
Errors in v1.0.0:
[UNKNOWN]: got error 'ExtendedError.1.1.RequestUriNotFound' for API path '[]'
Errors in v1.1.0:
[UNKNOWN]: got error 'ExtendedError.1.1.RequestUriNotFound/The request specified a URI of a resource that does not exist.' for API path '/redfish/v1/Chassis/3/Thermal'
[UNKNOWN]: got error 'ExtendedError.1.1.RequestUriNotFound/The request specified a URI of a resource that does not exist.' for API path '/redfish/v1/Chassis/3/Power'
I'll dig a bit deeper and try to get some more info back.