dell / iDRAC-Redfish-Scripting

Python and PowerShell scripting for Dell EMC PowerEdge iDRAC REST API with DMTF Redfish
GNU General Public License v2.0
608 stars 279 forks source link

iDRAC Version 7.00.00.171 Traceback Errors #291

Closed downtownle closed 6 months ago

downtownle commented 8 months ago

Hello Texas,

with iDRAC version 7.00.00.171 we have increased traceback errors, it seems as if the iDRAC no longer responds properly. Here are 2 examples of how Icinga reacted to this: picture1

File "/usr/lib64/nagios/plugins/check_redfish.py", line 178, in plugin.do_exit() File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 427, in do_exit print(self.return_output_data()) File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 303, in return_output_data for command in sorted(self.__output_data.get_commands(), key=lambda x: output_order.index(x)): File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 303, in for command in sorted(self.__output_data.get_commands(), key=lambda x: output_order.index(x)): ValueError: 'global' is not in list

And from another server:

{ "error": { "@Message.ExtendedInfo": [{ "Message": "The requested operation cannot be completed because of an internal error.", "MessageArgs": [], "MessageArgs@odata.count": 0, "MessageId": "IDRAC.2.8.SYS446", "RelatedProperties": [], "RelatedProperties@odata.count": 0, "Resolution": "Retry the operation after a few minutes. If the issue persists, contact your service provider.", "Severity": "Critical" }, { "Message": "The request failed due to an internal service error. The service is still operational.", "MessageArgs": [], "MessageArgs@odata.count": 0, "MessageId": "Base.1.12.InternalError", "RelatedProperties": [], "RelatedProperties@odata.count": 0, "Resolution": "Resubmit the request. If the problem persists, consider resetting the service.", "Severity": "Critical" } ], "code": "Base.1.12.GeneralError", "message": "A general error has occurred. See ExtendedInfo for more information" } }

I think it's not a login problem (we don't get a bad request 400), it seems more like a problem keeping the sessions/cleanly ending the Redfish call when I do a closessn -a, i.e. then manually end all logins works it again. To me this suggests that it keeps the session ID but does not pass the request data to this session ID ("The request failed due to an internal service error. The service is still operational."). Other checks with different session IDs continue to run smoothly. It may also have something to do with the size of the response ("The requested operation cannot be completed because of an internal error."), because they are usually checks with performance data or log entries (fan rotation, Mem utilization, MEL/SEL). be given.

texroemer commented 8 months ago

Hi @downtownle

Can you share the previous iDRAC version you were using before updating to 7.00.00.171 and did this version also have traceback errors?

For the internal error response do you know what URI(s) were being called for GET requests?

Thanks Tex

downtownle commented 8 months ago

Hello Tex,

the previous version was 6.10.80.00, A00

/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$skip=1350 (for logs, for example, it fetches the first 50 entries, then the next 50, etc.)

Otherwise, the plugin uses auto-discover to search for specific paths for e.g. FANs etc. and then wants to have the corresponding data.

We use the Icinga plugin CheckRedfish (https://github.com/bb-Ricardo/check_redfish [github.com])

texroemer commented 8 months ago

Thanks for the details, can you confirm with 6.10.80 you also see traceback errors or you only see the issue with 7.00.00.171?

Thanks Tex

downtownle commented 8 months ago

Hello Tex,

I can't rule out the possibility that it happened, but if it did, it was very, very rare and never noticed. With the new version we have 1 to 2 servers per day.

downtownle commented 8 months ago

Hello Tex,

4 again today, all log checks. It actually seems to have something to do with the size of the data to be received. Have you maybe a chance to check this in your lab? export1

texroemer commented 8 months ago

Hi @downtownle

Last night i looped (1000 loops) check_redfish.py with --all argument, unable to repro the issue.

One the server which repro the issue can you send me the value for "redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$select=Members@odata.count"?

Thanks Tex

downtownle commented 8 months ago

Hello Tex,

i added the command and parameters when you scroll down. 14G system R740xd idrac9 with fw version 7.00.00.171

Command which is executed: '/usr/lib64/nagios/plugins/check_redfish.py' '--host' '192.168.0.120' '--mel' '--password' 'geblurrt' '--retries' '5' '--timeout' '120' '--username' 'icinga'

__name "check_redfish_mel"
active true
arguments { --authfile: { description: "Autentication file content: username= password=", value: "$redfish_authfile$" }, --critical: { description: "Critical threshold for certain checks. See documentation", value: "$redfish_critical$" }, --detailed: { description: "always print detailed result instead of a condensed one line result", set_if: "$redfish_detailed$" }, --host: { description: "hostname or address of the interface to query", required: true, value: "$host.vars.interfaces_ilo$" }, --max: { description: "maximum of returned event log entries", value: "$redfish_max$" }, --mel: { required: true }, --password: { description: "The login password", value: "$redfish_password$" }, --retries: { description: "set number of maximum retries", value: "$redfish_retries$" }, --sessionfile: { description: "Name of the session file. make sure it is unique for every host", value: "$redfish_sessionfile$" }, --sessionfiledir: { description: "Directory where the session files should be stored", value: "$redfish_sessionfiledir$" }, --timeout: { description: "set number of request timeout per try/retry", value: "$redfish_timeout$" }, --username: { description: "The login user name", value: "$redfish_username$" }, --warning: { description: "Warning threshold for certain checks. See documentation", value: "$redfish_warning$" } }
command [ "/usr/lib64/nagios/plugins/check_redfish.py" ]
env null
execute { arguments: [ "checkable", "cr", "resolvedMacros", "useResolvedMacros" ], deprecated: false, name: "Internal#PluginCheck", side_effect_free: false, type: "Function" }
ha_mode 0
name "check_redfish_mel"
original_attributes null
package "director"
paused false
timeout 60
type "CheckCommand"
vars { check_address: { arguments: [], deprecated: false, name: "", side_effect_free: false, type: "Function" }, check_ipv4: false, check_ipv6: false, redfish_bmc: true }
version 0
zone "director-global"
texroemer commented 8 months ago

Thanks but can you share the members count for the LC logs on your server?

downtownle commented 8 months ago

Hello Tex,

what exactly do you mean by the members count for the LC logs on your server?

texroemer commented 8 months ago

Can you run GET on URI "redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$select=Members@odata.count" to get this count value?

Example:

[root@localhost ~]# curl -k -X GET -u root:calvin -H "Content-Type: application/json" 'https://192.168.0.120/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$select=Members@odata.count' --insecure | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 240 100 240 0 0 583 0 --:--:-- --:--:-- --:--:-- 583 { "@odata.context": "/redfish/v1/$metadata#LogEntryCollection.LogEntryCollection", "@odata.id": "/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries", "@odata.type": "#LogEntryCollection.LogEntryCollection", "Members@odata.count": 5389 }

downtownle commented 8 months ago

Hello Tex,

i added the output. p2 p3 p1

texroemer commented 8 months ago

Thanks for the details, on a server which repro the issue can you just loop check_redfish.py script from a terminal using a simple bash script and see if you can hit this issue. Would like to see if using this workflow can hit the issue (this is the workflow i used to try and repro which my server has over 5000 LC log entries, unable to hit the issue).

Also can you let me know if only one Redfish session to this iDRAC is running to pull data or are you running multiple Redfish sessions at the same time to this iDRAC?

Thanks Tex

downtownle commented 8 months ago

Hello Tex,

have you a example what you mean by that? "can you just loop check_redfish.py script from a terminal using a simple bash script and see if you can hit this issue"

texroemer commented 8 months ago

Sure, example below is a bash loop script i created which calls the python script. I just append the output to a file and then grep the file for any warning or critical errors.

root@localhost:/opt/check_redfish# cat loop.sh
#!/bin/bash

# Initialize counter
counter=1
idrac_ip=$1
idrac_username=$2
idrac_password=$3
arg_name=$4
loop_count=$5

touch loop.txt
echo > loop.txt

# While loop
while [ $counter -le $loop_count ]
do
    python3 check_redfish.py -H $idrac_ip -u $idrac_username -p $idrac_password $arg_name
    echo "- Current loop Count: $counter"
    ((counter++))
done

echo "Loop script finished"

root@localhost:/opt/check_redfish# ./loop.sh 192.168.0.120 root calvin --all 2 >> loop.txt
root@localhost:/opt/check_redfish# cat loop.txt | grep -i warning
root@localhost:/opt/check_redfish# cat loop.txt | grep -i critical
root@localhost:/opt/check_redfish#