bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
110 stars 30 forks source link

UNKNOWN issues after 1.3.0->1.4.1 update #96

Closed gabortakacs78 closed 1 year ago

gabortakacs78 commented 1 year ago

Hi,

once we updated plugin to latest version, we have a lot of UNKNOWN messages on our servers: [UNKNOWN]: No storage controller and disk drive data found in system [UNKNOWN]: Request error: No array controller data returned for API URL '/redfish/v1/Systems/1//SmartStorage/ArrayControllers?$expand=.' ...

I have checked the verbose output for same test server with old version (1.3.0 and 1.3.1) and new version, and everything seems to be the same, except the final Return Status and message, which contains this error in new version, and shows OK in old version: ... [OK]: Status of HP SmartArray and all components is: OK ...

Was there any changes related this setting? Should I remove "--storage" argument manually from each affected checks? Or is there any option to skip it? (similar to "--ignore_missing_ps"?

Thanks!

bb-Ricardo commented 1 year ago

Hi,

was there any status output except OK for the storage with the old plugin if you use --detailed on the effected servers?

gabortakacs78 commented 1 year ago

Hi,

sorry for late answer, I didn't get any notification from your answer :S

here is the end of output (exit status) of OLD version: [OK]: Chassi 1 : All fans (1) are in good condition [OK]: Chassi enclosurechassis : All fans (1) are in good condition [OK]: All memory modules (Total 768GB) are in good condition [OK]: All processors (2) are in good condition [OK]: Status of HP SmartArray and all components is: OK [OK]: INFO: HPE Synergy 480 Gen10 (CPU: 2, MEM: 768GB) - BIOS: I42 v2.60 (01/13/2022) - Serial: CZJxxxxxxx - Power: On - Name: xxxxxxx|'Fan_1.1'=21%;; 'Fan_enclosurechassis.1'=21%;;

And the new PLUGIN: [UNKNOWN]: No storage controller and disk drive data found in system [UNKNOWN]: Request error: No array controller data returned for API URL '/redfish/v1/Systems/1//SmartStorage/ArrayControllers?$expand=.' [OK]: Chassi 1 : All fans (1) are in good condition [OK]: Chassi enclosurechassis : All fans (1) are in good condition [OK]: All memory modules (Total 768GB) are in good condition [OK]: All processors (2) are in good condition|'Fan_1.1'=21%;; 'Fan_enclosurechassis.1'=21%;;

The beginning JSON part (detailed result) looks the same for both versions.

Br,

bb-Ricardo commented 1 year ago

Does this server have storage components?

gabortakacs78 commented 1 year ago

To tell the true I don't know... Can I check it form detailed output? (maybe from old version) I am already raised this question to Hardware colleagues (I am just responsible for monitoring), but still waiting for their answer.

bb-Ricardo commented 1 year ago

Yes, just use --storage --detailed --inventory then you can see what is actually available. If there are no storage controller or hard drives then the server has no storage components.

If server has no storage then you need to disable storage monitoring for these servers.

gabortakacs78 commented 1 year ago

Hi,

it shows no DATA with both version: { "inventory": { "chassi": [], "fan": [], "firmware": [], "logical_drive": [], "manager": [], "memory": [], "network_adapter": [], "network_port": [], "physical_drive": [], "power_supply": [], "processor": [], "storage_controller": [], "storage_enclosure": [], "system": [], "temperature": [] }, "meta": { "data_retrieval_issues": { "storage_controller": [ "No array controller data returned for API URL '/redfish/v1/Systems/1//SmartStorage/ArrayControllers?$expand=.'" ] }, "duration_of_data_collection_in_seconds": 0.375447, "host_that_collected_inventory": "xxxxx", "inventory_id": null, "inventory_layout_version": "xxx", "script_version": "xxx", "start_of_data_collection": "2022-09-20T15:45:53+02:00" } }

So it means there was some changes in check_redfish, which caused the change of return status. As I thinked, version 1.3.1 returned with OK in case of missing / no array controllers, but newest version 1.4.1 returned with status UNKNOWN. In this case we need to separate servers with / without storage, and define different checks for them.

It was also my proposal to servers guys on 6th Sept, but still no feedback from them. Maybe summer holidays finish soon...

Thanks for your help!

bb-Ricardo commented 1 year ago

Hi,

Well the old behavior was not correct as it reported OK for non existent components. Now it let's you know that no components to monitor were found.

This is important in case you have a server with storage components but they are, for some reason, are not reported. Then you would assume everything is OK even though nothing is monitored.

In the current implementation you will get an UNKNOWN if you try to monitor storage but no storage is reported.

It is important to know if a server has storage or power supply components in order to get correct monitoring results.

I've seen it quite a few times that an ILO reports components incorrectly and then your monitoring is pretty much worthless as you don't 'see' the real status of the components.

gabortakacs78 commented 1 year ago

Hi,

OK, I will update arguments once it is agreed from business side (HW guys) also. Thanks a lot for your support!

Br,

bb-Ricardo commented 1 year ago

No problem. You are welcome.