bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
110 stars 30 forks source link

NVMe not recognized #129

Closed Decstasy closed 1 month ago

Decstasy commented 1 month ago

Hello,

I have the case, that a DL360O G10 is not showing its NVMe drives beside the Smart Array P816i-a SR Gen10. The NVMe's are housed in a HPE 10SFF NVMe/SAS 10/8 Bkpln. If I run the check with --all argument, it is recognizing the drives, but not if I use it with --storage argument.

Might be related to https://github.com/bb-Ricardo/check_redfish/issues/113 and https://github.com/bb-Ricardo/check_redfish/issues/128

It seems, that the storage module is not decending into the chassis/backplane structure to get the drives, unfortunate I was not able to find the exact entry point inside the storage module, to change it myself :( I have used the newest check version and also tried the next_release branch.

Screenshot from 2024-06-03 15-15-55

check_redfish on  main [$?] via py3venv …
➜ ./check_redfish.py '--retries' '3' '--timeout' '13' -u administrator -p REDACTED --host REDACTED --storage -d
[OK]: All HP SmartArray controller (1), logical drives (1), physical drives (2), enclosures (2) and batteries (1) are in good condition.
[OK]: HPE Smart Array P816i-a SR Gen10 (FW: 6.52) status is: OK
[OK]: Physical Drive (2I:1:5) 240GB status: OK
[OK]: Physical Drive (2I:1:6) 240GB status: OK
[OK]: Logical Drive (0:1) 240.0GB (RAID 1) status: OK
[OK]: StorageEnclosure (1I:0) status: OK
[OK]: StorageEnclosure (2I:1) status: OK
[OK]: SmartStorageBattery 1 (charge level: 100%, capacity: 96W) status: OK

check_redfish on  main [$?] via py3venv took 2,9s …
➜ ./check_redfish.py '--retries' '3' '--timeout' '13' -u administrator -p REDACTED --host REDACTED --all -d | grep -i nvme
[OK]: HPE 10SFF NVMe/SAS 10/8 Bkpln (Box=1): 1.24
[OK]: HPE 10SFF NVMe/SAS 10/8 Bkpln (Box=1): 1.24
[OK]: HPE 10SFF NVMe/SAS 10/8 Bkpln (Box=1): 1.24
[OK]: HPE 10SFF NVMe/SAS 10/8 Bkpln (Box=1): 1.24
[OK]: HPE 10SFF NVMe/SAS 10/8 Bkpln (Box=1): 1.24
[OK]: NVMe Backplane Firmware (System Board): 1.24
[OK]: NVMe Drive (NVMe Drive Port 1B Box 1 Bay 9): HPK5
[OK]: NVMe Drive (NVMe Drive Port 2B Box 1 Bay 7): HPK5
[OK]: NVMe Drive (NVMe Drive Port 4B Box 1 Bay 3): HPK5
[OK]: NVMe Drive (NVMe Drive Port 5B Box 1 Bay 1): HPK5

Is it possible to sent you the raw api responses privately (via email found in the code)?

Thank your for this great project and support :)

P.S. I'm on a conferece for the next 3 days, so I can respond on friday again.

bb-Ricardo commented 1 month ago

Hi @Decstasy,

I just pushed a new commit to next-release branch. Now the NVMe drives should show up, but some data is reported twice (like the storage controller and logical volume). And for each NVMe a Physical Drive and a Storage controller are reported

Decstasy commented 1 month ago

Hey @bb-Ricardo, thank you very much for this fast bugfix! The server is now correctly monitored:

check_redfish on  next-release [$?] via py3venv …
➜ ./check_redfish.py '--retries' '3' '--timeout' '13' -u administrator -p REDACTED --host REDACTED --storage -d   
[OK]: All storage controllers (6), logical drives (2), physical drives (6), enclosures (2) and batteries (1) are in good condition.
[OK]: HPE Smart Array P816i-a SR Gen10 (FW: 6.52) status is: OK
[OK]: Physical Drive (2I:1:5) 240GB status: OK
[OK]: Physical Drive (2I:1:6) 240GB status: OK
[OK]: Logical Drive (0:1) 240.0GB (RAID 1) status: OK
[OK]: StorageEnclosure (1I:0) status: OK
[OK]: StorageEnclosure (2I:1) status: OK
[OK]: NVMe Storage Controller MO006400KYDND Bay 9 (FW: HPK5) status is: OK
[OK]: Physical Drive Secondary Storage Device 1:9 (MO006400KYDND / SSD / NVMe) 6401.25GiB status: OK
[OK]: NVMe Storage Controller MO006400KYDND Bay 7 (FW: HPK5) status is: OK
[OK]: Physical Drive Secondary Storage Device 1:7 (MO006400KYDND / SSD / NVMe) 6401.25GiB status: OK
[OK]: NVMe Storage Controller MO006400KYDND Bay 3 (FW: HPK5) status is: OK
[OK]: Physical Drive Secondary Storage Device 1:3 (MO006400KYDND / SSD / NVMe) 6401.25GiB status: OK
[OK]: NVMe Storage Controller MO006400KYDND Bay 1 (FW: HPK5) status is: OK
[OK]: Physical Drive Secondary Storage Device 1:1 (MO006400KYDND / SSD / NVMe) 6401.25GiB status: OK
[OK]: Controller HPE Smart Array P816i-a SR Gen10 status is: OK
[OK]: Logical Drive Logical Drive 1 (Logical Drive 1) 240GiB (RAID1) status: OK
[OK]: SmartStorageBattery 1 (charge level: 99%, capacity: 96W) status: OK

I will push this into our production next week, please leave the issue open, I will report you any side effects/bugs on other systems or a complete success :) Have a nice weekend!

Dennis

bb-Ricardo commented 1 month ago

great, this is what it should look like. It's unfortunate that some components are reported twice but this only applies to HPE server with iLO5 and recent firmware versions.

Decstasy commented 1 month ago

Hello Ricardo,

I pushed it into our production, everything looks good except for one machine.

Traceback (most recent call last):
  File "./check_redfish.py", line 171, in <module>
    if any(x in args.requested_query for x in ['storage', 'all']):  get_storage()
  File "/usr/lib64/nagios/plugins/check_redfish/cr_module/storage.py", line 35, in get_storage
    get_storage_generic(system)
  File "/usr/lib64/nagios/plugins/check_redfish/cr_module/storage.py", line 869, in get_storage_generic
    for storage_member in storage_response.get("Members"):
TypeError: 'NoneType' object is not iterable

The machine is using a HPE Smart Array P408i-a SR Gen10 controller with two storage enclosures. Do you already have an idea what's going wrong? If you need debug output or a mockup, I would like to send it privately. Thank you very much!

Dennis

bb-Ricardo commented 1 month ago

Hi @Decstasy, I just pushed another commit to next-release which should fix the issue.

is it possible that this server has an older iLO firmware then the other servers?

Decstasy commented 1 month ago

Yup, that is correct, there is a older FW installed. I would say issue fixed. Thank you!

bb-Ricardo commented 1 month ago

great, thank you for the feedback, just waiting for a response on the other issue and will release a new version.