Closed downtownle closed 6 months ago
Hi @downtownle
Can you share the previous iDRAC version you were using before updating to 7.00.00.171 and did this version also have traceback errors?
For the internal error response do you know what URI(s) were being called for GET requests?
Thanks Tex
Hello Tex,
the previous version was 6.10.80.00, A00
/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$skip=1350 (for logs, for example, it fetches the first 50 entries, then the next 50, etc.)
Otherwise, the plugin uses auto-discover to search for specific paths for e.g. FANs etc. and then wants to have the corresponding data.
We use the Icinga plugin CheckRedfish (https://github.com/bb-Ricardo/check_redfish [github.com])
Thanks for the details, can you confirm with 6.10.80 you also see traceback errors or you only see the issue with 7.00.00.171?
Thanks Tex
Hello Tex,
I can't rule out the possibility that it happened, but if it did, it was very, very rare and never noticed. With the new version we have 1 to 2 servers per day.
Hello Tex,
4 again today, all log checks. It actually seems to have something to do with the size of the data to be received. Have you maybe a chance to check this in your lab?
Hi @downtownle
Last night i looped (1000 loops) check_redfish.py with --all argument, unable to repro the issue.
One the server which repro the issue can you send me the value for "redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$select=Members@odata.count"?
Thanks Tex
Hello Tex,
i added the command and parameters when you scroll down. 14G system R740xd idrac9 with fw version 7.00.00.171
Command which is executed: '/usr/lib64/nagios/plugins/check_redfish.py' '--host' '192.168.0.120' '--mel' '--password' 'geblurrt' '--retries' '5' '--timeout' '120' '--username' 'icinga'
__name | "check_redfish_mel" |
---|---|
active | true |
arguments | { --authfile: { description: "Autentication file content: username= |
command | [ "/usr/lib64/nagios/plugins/check_redfish.py" ] |
env | null |
execute | { arguments: [ "checkable", "cr", "resolvedMacros", "useResolvedMacros" ], deprecated: false, name: "Internal#PluginCheck", side_effect_free: false, type: "Function" } |
ha_mode | 0 |
name | "check_redfish_mel" |
original_attributes | null |
package | "director" |
paused | false |
timeout | 60 |
type | "CheckCommand" |
vars | { check_address: { arguments: [], deprecated: false, name: " |
version | 0 |
zone | "director-global" |
Thanks but can you share the members count for the LC logs on your server?
Hello Tex,
what exactly do you mean by the members count for the LC logs on your server?
Can you run GET on URI "redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$select=Members@odata.count" to get this count value?
Example:
[root@localhost ~]# curl -k -X GET -u root:calvin -H "Content-Type: application/json" 'https://192.168.0.120/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries?$select=Members@odata.count' --insecure | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 240 100 240 0 0 583 0 --:--:-- --:--:-- --:--:-- 583 { "@odata.context": "/redfish/v1/$metadata#LogEntryCollection.LogEntryCollection", "@odata.id": "/redfish/v1/Managers/iDRAC.Embedded.1/LogServices/Lclog/Entries", "@odata.type": "#LogEntryCollection.LogEntryCollection", "Members@odata.count": 5389 }
Hello Tex,
i added the output.
Thanks for the details, on a server which repro the issue can you just loop check_redfish.py script from a terminal using a simple bash script and see if you can hit this issue. Would like to see if using this workflow can hit the issue (this is the workflow i used to try and repro which my server has over 5000 LC log entries, unable to hit the issue).
Also can you let me know if only one Redfish session to this iDRAC is running to pull data or are you running multiple Redfish sessions at the same time to this iDRAC?
Thanks Tex
Hello Tex,
have you a example what you mean by that? "can you just loop check_redfish.py script from a terminal using a simple bash script and see if you can hit this issue"
Sure, example below is a bash loop script i created which calls the python script. I just append the output to a file and then grep the file for any warning or critical errors.
root@localhost:/opt/check_redfish# cat loop.sh
#!/bin/bash
# Initialize counter
counter=1
idrac_ip=$1
idrac_username=$2
idrac_password=$3
arg_name=$4
loop_count=$5
touch loop.txt
echo > loop.txt
# While loop
while [ $counter -le $loop_count ]
do
python3 check_redfish.py -H $idrac_ip -u $idrac_username -p $idrac_password $arg_name
echo "- Current loop Count: $counter"
((counter++))
done
echo "Loop script finished"
root@localhost:/opt/check_redfish# ./loop.sh 192.168.0.120 root calvin --all 2 >> loop.txt
root@localhost:/opt/check_redfish# cat loop.txt | grep -i warning
root@localhost:/opt/check_redfish# cat loop.txt | grep -i critical
root@localhost:/opt/check_redfish#
Hello Texas,
with iDRAC version 7.00.00.171 we have increased traceback errors, it seems as if the iDRAC no longer responds properly. Here are 2 examples of how Icinga reacted to this:
File "/usr/lib64/nagios/plugins/check_redfish.py", line 178, in plugin.do_exit() File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 427, in do_exit print(self.return_output_data()) File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 303, in return_output_data for command in sorted(self.__output_data.get_commands(), key=lambda x: output_order.index(x)): File "/usr/lib64/nagios/plugins/dtag/check_redfish/cr_module/classes/plugin.py", line 303, in for command in sorted(self.__output_data.get_commands(), key=lambda x: output_order.index(x)): ValueError: 'global' is not in list
And from another server:
{ "error": { "@Message.ExtendedInfo": [{ "Message": "The requested operation cannot be completed because of an internal error.", "MessageArgs": [], "MessageArgs@odata.count": 0, "MessageId": "IDRAC.2.8.SYS446", "RelatedProperties": [], "RelatedProperties@odata.count": 0, "Resolution": "Retry the operation after a few minutes. If the issue persists, contact your service provider.", "Severity": "Critical" }, { "Message": "The request failed due to an internal service error. The service is still operational.", "MessageArgs": [], "MessageArgs@odata.count": 0, "MessageId": "Base.1.12.InternalError", "RelatedProperties": [], "RelatedProperties@odata.count": 0, "Resolution": "Resubmit the request. If the problem persists, consider resetting the service.", "Severity": "Critical" } ], "code": "Base.1.12.GeneralError", "message": "A general error has occurred. See ExtendedInfo for more information" } }
I think it's not a login problem (we don't get a bad request 400), it seems more like a problem keeping the sessions/cleanly ending the Redfish call when I do a closessn -a, i.e. then manually end all logins works it again. To me this suggests that it keeps the session ID but does not pass the request data to this session ID ("The request failed due to an internal service error. The service is still operational."). Other checks with different session IDs continue to run smoothly. It may also have something to do with the size of the response ("The requested operation cannot be completed because of an internal error."), because they are usually checks with performance data or log entries (fan rotation, Mem utilization, MEL/SEL). be given.