bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
115 stars 34 forks source link

No network adapter result in "Unable to connect to Host '0.0.0.0', max retries exhausted" #116

Closed lgmu closed 10 months ago

lgmu commented 1 year ago

Hi,

found a small bug on a HPE ProLiant RL300 Gen11 (CPU: 1, MEM: 512GB) - BIOS: R11 v1.20 (04/14/2023)

https://github.com/bb-Ricardo/check_redfish/blob/next-release/cr_module/nic.py#L295

On one host "system_response" only contains "Links/EthernetInterfaces/@odata.id" but not "Links/NetworkAdapters/@odata.id"

That results in

ethernet_interfaces_path = /redfish/v1/Systems/1/EthernetInterfaces/
network_adapter_path = None

network_adapter_response now tries to fetch data: https://github.com/bb-Ricardo/check_redfish/blob/next-release/cr_module/nic.py#L303

INFO: Attempt 1 of None?$expand=.
2023-06-01 14:17:17,505 - DEBUG: Starting new HTTPS connection (1): 0.0.0.0none:443
2023-06-01 14:17:17,510 - INFO: Retrying None?$expand=. [HTTPSConnectionPool(host='0.0.0.0none', port=443): Max retries exceeded with url: /?$expand=. (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fecc10f0f70>: Failed to establish a new connection: [Errno -2] Name or service not known'))]
2023-06-01 14:17:18,512 - DEBUG: HTTP REQUEST (GET) for None?$expand=.:

[CRITICAL]: Unable to connect to Host '0.0.0.0', max retries exhausted.

It added the "None" to the IP which results in the Name or service not known Error.

When I manually set network_adapter_path = f"{redfish_url}/NetworkInterfaces"

then it exits correctly: [UNKNOWN]: Request error: No network adapter or interface data returned for API URL '/redfish/v1/Systems/1//NetworkInterfaces'

bb-Ricardo commented 1 year ago

Hi,

I guess this is a new server with iLO 6? didn't have the chance to add support for this version due to lack of any G11 server on site. Would it be ok to create a mockup of this machine send it to me?

then I could add G6 support as well.

Thank you.

lgmu commented 1 year ago

Yes it is an iLO 6 server, but apparently AMS - Agentless Management Service isn't installed on that system yet, maybe that's the reason why it doesn't work.

I need to check if it's okay to provide a mockup.

bb-Ricardo commented 1 year ago

Hi,

was just wondering if any updates on this?

thank you

lgmu commented 1 year ago

Hi, they still haven't installed it, no idea why it takes sooooooo long to do this for a single host...

lgmu commented 11 months ago

Hi, finally it is installed, but it still does not work. I created a mockup and will clarify if I can share it with you!

lgmu commented 11 months ago

@bb-Ricardo I've sent you an email with an iLO 6 mockup, hope this helps!

bb-Ricardo commented 11 months ago

That is amazing, thank you very much. Will have a look what's causing the issue.

bb-Ricardo commented 11 months ago

Hi,

I just pushed a change to next-release branch. This was quite some work as HPE doesn't like to stick to one solution. It also improves the network interface output for iLO 5 systems.

Can you please test it out and let me know if it works for you?

thank you.

lgmu commented 11 months ago

Hi,

thanks for the change!

iLO 6: [OK]: All network adapter (1) and ports (2) are in good condition iLO 5: [OK]: All network adapter (2) and ports (2) are in good condition

Unfortunately iLO 4 (Firmware: iLO 4 v2.82) broke: [CRITICAL]: Unable to connect to Host '0.0.0.0', max retries exhausted.

2023-12-22 09:15:49,271 - INFO: Attempt 3 of None?$expand=.
2023-12-22 09:15:49,273 - DEBUG: Starting new HTTPS connection (3): 0.0.0.0none:443
2023-12-22 09:15:49,275 - INFO: Retrying None?$expand=. [HTTPSConnectionPool(host='0.0.0.0none', port=443): Max retries exceeded with url: /?$expand=. (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd5a92619a0>: Failed to establish a new connection: [Errno -2] Name or service not known'))]
2023-12-22 09:15:50,276 - DEBUG: HTTP REQUEST (GET) for None?$expand=.:

Now iLO 4 has the same problem that iLO 6 had before the change.

This worked fine with the old version: [OK]: All network adapter (1) and ports (6) are in good condition

bb-Ricardo commented 11 months ago

Ohh no, damn it. But thank you for testing.

I have a couple of iLO 4 mockups which I tested against and none of them showed up an issue.

Then would it be possible to provide a mockup for the server which causes this issue? That would be awesome.

I also pushed a change regarding storage discovery for iLO6. The NVMe drives should now only appear once in the output.

Thank you.

bb-Ricardo commented 11 months ago

hey @lgmu,

I just pushed another commit to next-release which might fix the issue. Can you please test this one again?

Thank you

lgmu commented 11 months ago

Yes, looks good now!

[OK]: BMC: iLO 4 (Firmware: iLO 4 v2.82) and all nics are in 'OK' state. [OK]: All network adapter (1) and ports (6) are in good condition

Thanks for the fix, I don't have time today to test everything in detail, but I will give you more feedback after the christmas holidays.

Enjoy your holidays aswell!

bb-Ricardo commented 11 months ago

Great, thank you for testing.

Just test it once you have time and just close the issue if everything looks good.

Also enjoy the holidays and all the best for 2024.

lgmu commented 10 months ago

Happy new year!

We are using the new version in prod now and I haven't noticed any problems so far, thanks!

Sometimes we have checks flapping to UNKNOWN because of this: [UNKNOWN]: None : Request error: No 'chassis' property found in root path '/redfish/v1'

But this also happened in the old version and I believe it's a Dell Redfish API problem. It doesn't return the data sometimes (but unrelated to this issue, so I will close it)

bb-Ricardo commented 10 months ago

Happy new Year and thank you for testing.