bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
110 stars 30 forks source link

HP ProLiant DL360 Gen10 storage cache handling #122

Closed Decstasy closed 6 months ago

Decstasy commented 8 months ago

Hello,

first of all, thank you for this great open source project :)

I have a problem regarding the storage battery cache status. At the moment it looks like this:

[WARNING]: Smart Array controller cache (2048MB)  status: WARNING (health information missing)
[OK]: HPE Smart Array P408i-a SR Gen10 (FW: 1.04) status is: OK
[OK]: Physical Drive (1I:1:1) 960GB status: OK
[OK]: Physical Drive (1I:1:2) 960GB status: OK
[OK]: Logical Drive (0:1) 960.2GB (RAID 1) status: OK
[OK]: StorageEnclosure (1I:1) status: OK
[OK]: StorageEnclosure (2I:0) status: OK
[OK]: SmartStorageBattery 1 (charge level: 100%, capacity: 96W) status: OK

By looking at the raw responses and the code my first conclusion was, well hp changed the key name...

--- a/cr_module/storage.py
+++ b/cr_module/storage.py
@@ -400,7 +400,7 @@ def get_storage_hpe(system):
                     firmware=grab(controller_response, "FirmwareVersion.Current.VersionString"),
                     serial=controller_response.get("SerialNumber"),
                     location=controller_response.get("Location"),
-                    backup_power_health=grab(controller_response, "CacheModuleStatus.Health"),
+                    backup_power_health=grab(controller_response, "ControllerBoard.Status.Health"),
                     backup_power_present=backup_power_present,
                     cache_size_in_mb=controller_response.get("CacheMemorySizeMiB"),
                     system_ids=system_id

That changed the status accordingy and the problem was gone...

BUT later I realized that my first "fix" might be just wrong. In the http responses, the cache battery status, charge etc is displayed under /redfish/v1/Chassis/1. So I'm a little bit confused right now why is that?

I'm willing to fix it but I still lack of a little deeper understanding in this since I dont want to implement a bug.

Here are 3 responses with a little redaction that are relevant:

{
    "@odata.context": "/redfish/v1/$metadata#HpeSmartStorage.HpeSmartStorage",
    "@odata.etag": "W/REDACTED",
    "@odata.id": "/redfish/v1/Systems/1/SmartStorage/",
    "@odata.type": "#HpeSmartStorage.v2_0_0.HpeSmartStorage",
    "Description": "HPE Smart Storage",
    "Id": "SmartStorage",
    "Links": {
        "ArrayControllers": {
            "@odata.id": "/redfish/v1/Systems/1/SmartStorage/ArrayControllers/"
        },
        "HostBusAdapters": {
            "@odata.id": "/redfish/v1/Systems/1/SmartStorage/HostBusAdapters/"
        }
    },
    "Name": "HpeSmartStorage",
    "Status": {
        "Health": "OK"
    }
}
{
    "@odata.context": "/redfish/v1/$metadata#HpeSmartStorageArrayControllerCollection.HpeSmartStorageArrayControllerCollection",
    "@odata.etag": "W/REDACTED",
    "@odata.id": "/redfish/v1/Systems/1/SmartStorage/ArrayControllers/",
    "@odata.type": "#HpeSmartStorageArrayControllerCollection.HpeSmartStorageArrayControllerCollection",
    "Description": "HPE Smart Storage Array Controllers View",
    "Members": [
        {
            "@odata.context": "/redfish/v1/$metadata#HpeSmartStorageArrayController.HpeSmartStorageArrayController",
            "@odata.id": "/redfish/v1/Systems/1/SmartStorage/ArrayControllers/0/",
            "@odata.type": "#HpeSmartStorageArrayController.v2_1_0.HpeSmartStorageArrayController",
            "AdapterType": "SmartArray",
            "BackupPowerSourceStatus": "Present",
            "CacheMemorySizeMiB": 2048,
            "CacheModuleSerialNumber": "               ",
            "ControllerBoard": {
                "Status": {
                    "Health": "OK"
                }
            },
            "ControllerPartNumber": "836260-001",
            "CurrentOperatingMode": "Mixed",
            "Description": "HPE Smart Storage Array Controller View",
            "DriveWriteCache": "Disabled",
            "EncryptionCryptoOfficerPasswordSet": false,
            "EncryptionCspTestPassed": true,
            "EncryptionEnabled": false,
            "EncryptionFwLocked": false,
            "EncryptionHasLockedVolumesMissingBootPassword": false,
            "EncryptionMixedVolumesEnabled": false,
            "EncryptionSelfTestPassed": true,
            "EncryptionStandaloneModeEnabled": false,
            "ExternalPortCount": 0,
            "FirmwareVersion": {
                "Current": {
                    "VersionString": "1.04"
                }
            },
            "HardwareRevision": "B",
            "Id": "0",
            "InternalPortCount": 2,
            "Links": {
                "LogicalDrives": {
                    "@odata.id": "/redfish/v1/Systems/1/SmartStorage/ArrayControllers/0/LogicalDrives/"
                },
                "PhysicalDrives": {
                    "@odata.id": "/redfish/v1/Systems/1/SmartStorage/ArrayControllers/0/DiskDrives/"
                },
                "StorageEnclosures": {
                    "@odata.id": "/redfish/v1/Systems/1/SmartStorage/ArrayControllers/0/StorageEnclosures/"
                },
                "UnconfiguredDrives": {
                    "@odata.id": "/redfish/v1/Systems/1/SmartStorage/ArrayControllers/0/UnconfiguredDrives/"
                }
            },
            "Location": "Slot 0",
            "LocationFormat": "PCISlot",
            "Model": "HPE Smart Array P408i-a SR Gen10",
            "Name": "HpeSmartStorageArrayController",
            "ReadCachePercent": 10,
            "SerialNumber": "REDACTED ",
            "Status": {
                "Health": "OK",
                "State": "Enabled"
            },
            "WriteCacheBypassThresholdKB": 1040,
            "WriteCacheWithoutBackupPowerEnabled": false
        }
    ],
    "Members@odata.count": 1,
    "Name": "HpeSmartStorageArrayControllers"
}
{
    "@odata.context": "/redfish/v1/$metadata#Chassis.Chassis",
    "@odata.etag": "W/REDACTED",
    "@odata.id": "/redfish/v1/Chassis/1/",
    "@odata.type": "#Chassis.v1_2_0.Chassis",
    "ChassisType": "RackMount",
    "Id": "1",
    "Links": {
        "ComputerSystems": [
            {
                "@odata.id": "/redfish/v1/Systems/1/"
            }
        ],
        "ManagedBy": [
            {
                "@odata.id": "/redfish/v1/Managers/1/"
            }
        ]
    },
    "Manufacturer": "HPE",
    "Model": "ProLiant DL360 Gen10",
    "Name": "Computer System Chassis",
    "NetworkAdapters": {
        "@odata.id": "/redfish/v1/Chassis/1/NetworkAdapters/"
    },
    "Oem": {
        "Hpe": {
            "@odata.context": "/redfish/v1/$metadata#HpeServerChassis.HpeServerChassis",
            "@odata.type": "#HpeServerChassis.v2_1_0.HpeServerChassis",
            "Actions": {
                "#HpeServerChassis.DisableMCTPOnServer": {
                    "target": "/redfish/v1/Chassis/1/Actions/Oem/Hpe/HpeServerChassis.DisableMCTPOnServer/"
                },
                "#HpeServerChassis.FactoryResetMCTP": {
                    "target": "/redfish/v1/Chassis/1/Actions/Oem/Hpe/HpeServerChassis.FactoryResetMCTP/"
                }
            },
            "Firmware": {
                "PlatformDefinitionTable": {
                    "Current": {
                        "VersionString": "2.36.0 Build 10"
                    }
                },
                "PowerManagementController": {
                    "Current": {
                        "VersionString": "1.0.4"
                    }
                },
                "PowerManagementControllerBootloader": {
                    "Current": {
                        "Family": "25",
                        "VersionString": "1.1"
                    }
                },
                "SPSFirmwareVersionData": {
                    "Current": {
                        "VersionString": "4.0.4.288"
                    }
                },
                "SystemProgrammableLogicDevice": {
                    "Current": {
                        "VersionString": "0x2A"
                    }
                }
            },
            "Links": {
                "Devices": {
                    "@odata.id": "/redfish/v1/Chassis/1/Devices/"
                }
            },
            "MCTPEnabledOnServer": true,
            "SmartStorageBattery": [
                {
                    "ChargeLevelPercent": 100,
                    "FirmwareVersion": "2.1",
                    "Index": 1,
                    "MaximumCapWatts": 96,
                    "Model": "727258-B21",
                    "ProductName": "HPE Smart Storage Battery ",
                    "RemainingChargeTimeSeconds": 0,
                    "SerialNumber": "REDACTED",
                    "SparePartNumber": "871264-001",
                    "Status": {
                        "Health": "OK",
                        "State": "Enabled"
                    }
                }
            ],
            "SystemMaintenanceSwitches": {}
        }
    },
    "Power": {
        "@odata.id": "/redfish/v1/Chassis/1/Power/"
    },
    "SKU": "867959-B21",
    "SerialNumber": "REDACTED",
    "Status": {
        "Health": "OK",
        "State": "Enabled"
    },
    "Thermal": {
        "@odata.id": "/redfish/v1/Chassis/1/Thermal/"
    }
}

It would be nice if you can give me some advice and background informations to understand this behaviour.

Thank you and best regards, Dennis

bb-Ricardo commented 8 months ago

Hi,

thank you for reporting this issue. It would be quite easy to fix to keep backward compatibility:

                    backup_power_health=grab(controller_response, "CacheModuleStatus.Health") or grab(controller_response, "ControllerBoard.Status.Health"),

But I'm not sure if CacheModuleStatus is actually the same as ControllerBoard. In my test data it looks like this:

    "CacheModuleStatus": {
        "Health": "OK"
    },
    "ControllerBoard": {
        "Status": {
            "Health": "OK"
        }
    },

It could be just a bug in the iLO or controller firmware. Do you have other examples in your company with similar configurations but different firmware versions?

Decstasy commented 8 months ago

Unfortunately we have not a server with the very same configuration. There is a server with similar Hardware but mismatching Firmware versions; althogh one of these servers have encountered the same problem. We will try to install the newest SPP since other servers are looking fine so far.

Regarding the fix for backward compatibility I will try to ask HP support in this manner. I was not able to find out that CacheModuleStatus is actually the same as ControllerBoard with their Redfish API docs.

I keep you updated, but this will take some time. Downtimes etc...

Have a great week, Dennis

bb-Ricardo commented 8 months ago

Hi @Decstasy

Thank you for checking it out. Please keep me updated and share any news on.

Cheers Ricardo

bb-Ricardo commented 6 months ago

Hi, any updates? Otherwise it would be great if we could close this issue.