Open HenryTheSir opened 4 months ago
Optimizations here could be adapted directly for this issue: https://github.com/Graylog2/graylog2-server/issues/18563
Hello Henry, thanks for raising this!
Just to note, flapping leader usually indicates that the HTTP thread pool on the leader is being exceeded.
We observe this occur on clusters that have many hundreds of Sidecars that phone home frequently on top of other performance issues (can reduce impact by increasing the phone home interval of sidecars), clusters that have MongoDB using low IOPS storage (less than 3000 IOPS), clusters with alerts set to run very frequently (every x seconds) and on clusters where in the server.conf the sum of values for the config keys below is too high:
inputbuffer_processors processbuffer_processors outputbuffer_processors
Expected Behavior
Errors under System/Overview are reliable
Current Behavior
Cycling an index in an env with hundreds of index sets leads to an Indices blocked error under System/Overview for the old Index. It seams that the check (https://github.com/Graylog2/graylog2-server/blob/5.2.8/graylog2-server/src/main/java/org/graylog2/periodical/IndexBlockCheck.java) checks the old index rather then the new one.
Possible Solution
https://github.com/Graylog2/graylog2-server/blob/5.2.8/graylog2-server/src/main/java/org/graylog2/periodical/IndexBlockCheck.java#L59
Currently fetches all write indices and afterwards loop over all to check. Better approach could be to loop over all index sets and in this process fetch the current index and check imediatly or just check the write deflector. This could save up round trips to mongodb and for all index sets just the _deflector index can be checked.
Also it would be nice if this job would not run every 30 seconds as it puts unneeded pressure on the leader and can lead to flapping leader (No Leader in Cluster, short after this notification the leader reoccurs)
Steps to Reproduce (for bugs)
Context
Your Environment