Closed pwloc closed 5 months ago
If the script returns a non-zero exit code (and the checker does not have misc_dynamic enabled), the last_status is only updated if the previous exit code was 0 (i.e. success). In other words, for a non-dynamic checker the "Last status" is only updated when the checker changes status from success to failure or vice versa. The reason for this is that the exit code changing from 1 to 2, or 2 to 1, or 100 to 1, etc, is not changing the status of the checker, i.e. it remains in the failed state. The detail is in the _misc_check_childthread() function in the keepalived/check/check_misc.c source file.
keepalived is functioning correctly, other than the Last status
reported in the diagnostic output may be misleading.
If it is important for you that the Last status
field in the diagnostic output is more than just, in essence, either zero or non-zero, could you please explain why, and I will consider changing the code. If you are using SUGUSR1 to write keepalived_check.data in order to check the last exit code returned by your script, it would be much more efficient for the script to write the exit code it will return to a diagnostic file.
Thanks for explaining the behavior. What I'm trying to do is to monitor the state of keepalived's checks so I could know not only if but also why given real server is down. Misc check is not my only check. I also use BFD check. Therefore I prefer to get all checks status from one place that is from keepalived. Moreover, this way I get real checks' state not the state of diagnostic files. I agree that check status (up/down) is more important but exit code provides some extra troubleshooting info, at least in my case. Last status is present in the state file anyway so having it updated like "Last ran" is would be great.
Btw. have you considered including checks' statuses (up/down at least) in the SNMP MIB ? Dumping and parsing state files doesn't seem to be the best method for such use case and I guess is not intended for that purpose.
@pwloc I hadn't considered adding checker information to the SNMP MIB (I find adding extra SNMP code very tedious!). I wonder if it would be more appropriate to use traps to notify status changes. I would appreciate your thoughts on this.
If you or someone else would like to modify the code to add the extra SNMP functionality and submit a pull request, it would be very helpful.
I understand. SNMP is not easy protocol to work with. As for SNMP traps they don't fit my architecture. I need a place to query for checks status. I would have to add traps listener to hold the state. Prefer to stick with data file.
As for implementing SNMP features, I'm affraid that you overestimate my C skills. I'll see If I could find someone else but I find chances slim.
Commit a4258a6 now updates the exit code if it changes, even if the checker is already down.
Describe the bug I've got virtual server configured and real_server has the following misc check configured. The check returns 0 or 1 that can be found in the file or 2 if the file is not present.
Whenever exit code changes between 2 and 1 (or greater value) the change is not present in keepalived_check.data. Of course data file is dumped again. "Last ran" gets updated but "Last status" not
When exit code changes from 0 to 1 or 2 (or vice versa) there is no problem.
To Reproduce
Expected behavior keepalived_check.data is updated every time check's exit code changes.
Keepalived version
Distro (please complete the following information):
Details of any containerisation or hosted service (e.g. AWS) Alpine running in docker container running on Ubuntu 22.04.2