Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.37k stars 1.06k forks source link

Racecondition on Index Cycle (manual or by retention config) with IndexBlockCheck #19571

Open HenryTheSir opened 4 months ago

HenryTheSir commented 4 months ago

Expected Behavior

Errors under System/Overview are reliable

Current Behavior

Cycling an index in an env with hundreds of index sets leads to an Indices blocked error under System/Overview for the old Index. It seams that the check (https://github.com/Graylog2/graylog2-server/blob/5.2.8/graylog2-server/src/main/java/org/graylog2/periodical/IndexBlockCheck.java) checks the old index rather then the new one.

Possible Solution

https://github.com/Graylog2/graylog2-server/blob/5.2.8/graylog2-server/src/main/java/org/graylog2/periodical/IndexBlockCheck.java#L59

Currently fetches all write indices and afterwards loop over all to check. Better approach could be to loop over all index sets and in this process fetch the current index and check imediatly or just check the write deflector. This could save up round trips to mongodb and for all index sets just the _deflector index can be checked.

Also it would be nice if this job would not run every 30 seconds as it puts unneeded pressure on the leader and can lead to flapping leader (No Leader in Cluster, short after this notification the leader reoccurs)

Steps to Reproduce (for bugs)

  1. create huge amount of index sets (maybe 500?)
  2. cycle an index
  3. observe sometimes false positiv alerts for blocked indices
  4. in the next run of the job the alert gets cleared

Context

Your Environment

HenryTheSir commented 4 months ago

Optimizations here could be adapted directly for this issue: https://github.com/Graylog2/graylog2-server/issues/18563

tellistone commented 4 months ago

Hello Henry, thanks for raising this!

Just to note, flapping leader usually indicates that the HTTP thread pool on the leader is being exceeded.

We observe this occur on clusters that have many hundreds of Sidecars that phone home frequently on top of other performance issues (can reduce impact by increasing the phone home interval of sidecars), clusters that have MongoDB using low IOPS storage (less than 3000 IOPS), clusters with alerts set to run very frequently (every x seconds) and on clusters where in the server.conf the sum of values for the config keys below is too high:

inputbuffer_processors processbuffer_processors outputbuffer_processors