Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.37k stars 1.06k forks source link

LeaderPresenceCheckPeriodical is too sensitive, should allow for momentary blips #20672

Open drewmiranda-gl opened 1 week ago

drewmiranda-gl commented 1 week ago

Graylog's use of LeaderPresenceCheckPeriodical is very aggressive and even a momentary blip where Graylog is not able to communicate to its Mongo cluster (for example if the Graylog server is unable to resolve the hostname configured in Graylog's mongo uri) Graylog will present a NO_LEADER system alert.

It looks like there is already code to allow for leniency with these checks but it only applies when using automatic leader election.

Additionally, Graylog does not log any further information about this issue making it very difficult (if not impossible) to both understand what is happening and how to resolve it or prevent it from occurring.

Expected Behavior

Current Behavior

NO_LEADER system alert is far to sensitive and triggers for momentary blips.

No further logging nor information is provided about why this happened.

Possible Solution

Allow grace period even when NOT using automatic leader election.

Provide logging about what is happening.

Steps to Reproduce (for bugs)

I can reproduce this as follows:

Using Ubuntu Server 22.04 LTS

  1. Configure graylog to use a .local for the mongo uri
  2. Stop the DNS resolver that can resolve the above domain name
  3. Wait a couple of seconds, restart the DNS resolver
  4. Observe graylog will fire a NO_LEADER system notification

Context

Uncovered this after enabling notifications for Graylog's "System notification events" event definitions and it would repeatedly trigger. I then correlated that this always occurred whenever the resolver server was restarted on my pfSense router.

Your Environment

Please let me know if there are any questions

drewmiranda-gl commented 1 week ago

Working on properly reproducing this and now I'm not sure the issue has anything to do with DNS. It seems Ubuntu can continue to resolve the .local hostnames regardless of the upstream DNS resolver running or being stopped. I'm not sure why stopping/starting would cause downstream blip unless this is a weird edgecase/bug with ubuntu's systemd-resolved.

If anyone has any ideas on how to enable more verbose logging in graylog to troubleshoot let me know. I attempted to enable several loggers but could not get any output to server.log