bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
115 stars 34 forks source link

Issues with sessions not reused. #89

Closed dan-m-joh closed 2 years ago

dan-m-joh commented 2 years ago

Hello Ricardo,

Today I tried to update check_redfish from 1.3.2 to 1.4.0 but I failed miserably :-(

As soon as I changed to 1.4.0 I started to get a lot of "Username or password invalid." messages. Rolling back to 1.3.2 "solves" the issue and everything works as expected again. Have you changed anything in the "user / session handling" part in the 1.4.0 release?

The really strange thing is that about 1 out of 10 times it works. For example, the following two commands where run about 5 seconds apart:

$ ./check_redfish.py -H <hostname> -u <user> -p <password> --sessionfiledir /mnt/data01/redfish_sessions --bmc --detailed
[CRITICAL]: Username or password invalid.

$ ./check_redfish.py -H <hostname> -u <user> -p <password> --sessionfiledir /mnt/data01/redfish_sessions --bmc --detailed
[OK]: BMC: iBMC (Firmware: 5.13) and all nics are in 'OK' state.
[OK]: NIC 1:<mac> '<hostname>' (IPs: <IPv4>/<IPv6>) (speed: None, autoneg: None, duplex: None) status: OK
[OK]: BMC License: NotInstalled (NotInstalled)

And, this happens to all of our Hardware-types (DELL, HPE, Huawei).

Do you have any idea or suggestion?

Regards, Dan

bb-Ricardo commented 2 years ago

Hi, I also experienced this issue. Currently investigating.

Sorry for the inconvenience.

ajoergensen commented 2 years ago

I don't know if it's relevant but I'm only experiencing this on Cisco (CIMC) servers, but on all of them. HPE and Dell are fine (for now at least)

2022-05-12 14:39:48,429 - INFO: Login returned code 400: {
    "error":    {
        "code": "Base.1.4.0.GeneralError",
        "message":  "See ExtendedInfo for more information.",
        "@Message.ExtendedInfo":    [{
                "@odata.type":  "Message.v1_0_6.Message",
                "MessageId":    "Base.1.4.0.SessionLimitExceeded",
                "Message":  "The session establishment failed due to the number of simultaneous sessions exceeding the limit of the implementation.",
                "MessageArgs":  [],
                "Severity": "Warning",
                "Resolution":   "Reduce the number of other sessions before trying to establish the session or increase the limit of simultaneous sessions (if supported)."
            }]
    }
}
[CRITICAL]: Username or password invalid.

I checked the CIMC, it says 0/4 RedFish sessions are in use.

bb-Ricardo commented 2 years ago

thank you for testing, something strange is going on and I have to find the issue. Hope I can push 1.4.1 tomorrow with a fix.

dan-m-joh commented 2 years ago

No Problem, Take your time :-) Rolling back to 1.3.2 "solves" the issue.

Dan

bb-Ricardo commented 2 years ago

Hi,

I assume I found the issue with the last release. I pushed a change to "next-release". Can you please test it and see if this works better?

Thank you.

dan-m-joh commented 2 years ago

Good morning Ricardo,

Sad to say, No. :-( The issue still exists in the "next-release - same error message "Username or password invalid.".

Could this have something todo with the fact that we are still using the redfish-2.1.8 python library?

Regards, Dan

bb-Ricardo commented 2 years ago

Hi Dan,

Thank you for testing it again.

This is strange that it doesn't work for you. I tested the change in our environment and the issue didn't occur again. We are also still on "redfish 2.1.4".

The existing sessions will be used as before and the plugin does not seem to have any issues with logging in. 🤔

bb-Ricardo commented 2 years ago

Just released Version 1.4.1 > https://github.com/bb-Ricardo/check_redfish/releases/tag/v1.4.1

dan-m-joh commented 2 years ago

Hello Ricardo,

Sorry, could you please reopen this issue? The error still persists :-(

I think (not 100% sure) I have found a common factor for our problem. If I call check_redfish like this: check_redfish.py -H <hostname> -u <user> -p <password> --sessionfiledir /mnt/data01/redfish_sessions --bmc --detailed it works as it should. If I on the other hand calls it like this (with FQDN): check_redfish.py -H <hostname>.<domain> -u <user> -p <password> --sessionfiledir /mnt/data01/redfish_sessions --bmc --detailed it fails with "Username or password invalid."

As I said, I am not yet 100% sure - I have to do some more testing (I have to do some careful testing as I can only test this in production). Will get back to you as soon as I can.

Regards, Dan

bb-Ricardo commented 2 years ago

Hi Dan,

interesting finding. can you run the plugin with -v option and post the output here (make sure to remove username and password)?

  1. run without an existing session file
  2. run with existing session file

Thank you

dan-m-joh commented 2 years ago

This is really "interesting"... It looks like the FQDN is not the issue... The thing is, when executed using the CLI it works, as soon as I execute it from Nagios it starts erroring out.

I'll try running it without a sessionfile (--nosession).

Dan

dan-m-joh commented 2 years ago

Ahhhh, now I think I know what is going on...

It looks like we are hitting some "user-limit" on the redfish-interface of our Huawei-Servers (no matter if we use a session-file or not). If I select all of the redfish-checks for a host (10 - 12 checks) and do a "Reschedule Check Now" all but about four checks goes immediate "Red" with the "Username or password invalid." error message. If I do "Reschedule Check Now" staggered, the checks comes back to OK-Status.

I'll try if I can script something to "provoke" an error and see if I can see something in the "-v" output.

Dan

log1-c commented 2 years ago

I noticed this after upgrading today as well:

The debug output shows:

Body Response of /redfish/v1/SessionService/Sessions: b'{"error":{"@Message.ExtendedInfo":[{"Message":"The maximum number of user sessions is reached.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"IDRAC.2.4.RAC0218","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"Make sure that the number of active sessions is less than or equal to the threshold limit. To end a current session and start a new session, do one of the following: 1) Log out of iDRAC by using a Graphical User Interface (GUI). 2) End any unused session by running the following RACADM command at the Command Line Interface (CLI): racadm closessn -i [session id].Warning: Ending a session may abruptly stop any running operation on that specific iDRAC session.","Severity":"Informational"}],"code":"Base.1.7.GeneralError","message":"A general error has occurred. See ExtendedInfo for more information"}}'
2022-05-18 14:07:48,103 - INFO: Login returned code 503: {"error":{"@Message.ExtendedInfo":[{"Message":"The maximum number of user sessions is reached.","MessageArgs":[],"MessageArgs@odata.count":0,"MessageId":"IDRAC.2.4.RAC0218","RelatedProperties":[],"RelatedProperties@odata.count":0,"Resolution":"Make sure that the number of active sessions is less than or equal to the threshold limit. To end a current session and start a new session, do one of the following: 1) Log out of iDRAC by using a Graphical User Interface (GUI). 2) End any unused session by running the following RACADM command at the Command Line Interface (CLI): racadm closessn -i [session id].Warning: Ending a session may abruptly stop any running operation on that specific iDRAC session.","Severity":"Informational"}],"code":"Base.1.7.GeneralError","message":"A general error has occurred. See ExtendedInfo for more information"}}
[CRITICAL]: Username or password invalid.

exmaple command is

./check_redfish.py '--power' '--authfile' '/etc/icinga2/redfish.cfg' '--host' 'HOST.DOMAIN' '--retries' '5' '--timeout' '10'

Interestingly this only happens for two Dell XC640 systems. Others systems (different type/vendor and other XC640) are working fine.

I will check with a colleague what changed for those two systems edit: we just had a look and he couldn't even login to the webui with the root/admin user. I will now try to restart the iDrac via the CLI.

edit#2: well, unsurprisingly the iDrac restart made the checks work again.

bb-Ricardo commented 2 years ago

It also depends on the amount of workers. Every monitoring worker not sharing the session file will use a separate session.

If a session file is present, then this session file should be reused to work with the same session ID.

dan-m-joh commented 2 years ago

I have now "activated" the 1.4.1 version and added "--retries 5" to my cmd-line. Until now it looks OK - no idea why it did not work last time I tried the 1.4.1 version.

Now we just have to wait and see...