Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.33k stars 1.06k forks source link

Race Condition between Cycle Deflector & Index Retention for Rebuild Index Ranges #779

Closed NickMeves closed 9 years ago

NickMeves commented 9 years ago

v0.91.3

After the deflector index is cycled a rebuild index ranges jobs kicks off.

Occasionally while that job is still running, the retention strategy will kick off since we have more indices than our max (this is every 5 minutes according to periodical/IndexRetentionThread.java?). The rebuild index ranges job portion of the retention strategy will fail due to maximum concurrency of the job being reached.

Now the first rebuild index ranges finishes and I have ranges for 21 indices reported. But I only have 20 indices since retention closed/deleted my oldest. Now any "Search in all messages" searches will fail since it will try to search a closed or deleted index until the next rebuild index ranges kicks off and gets it right (may be a few hours or days).

As a workaround I am currently monitoring the logs and kicking off a rebuild ranges job via the API if I see this collision.

kroepke commented 9 years ago

matches the stuff we need to do for #833 pretty well

kroepke commented 9 years ago

Actually the retention should trigger a new index ranges rebuild. That might take too long, though, because it actually recalculates it for all indices.

NickMeves commented 9 years ago

Right, the retention strategy kicks off and starts the rebuild index ranges job. The problem occurs when the rebuild index ranges job from when the deflector cycled is still running; the rebuild index range job from the retention strategy job (which will process the appropriate number of indices and not include the closed/deleted index) will quit without running:

ERROR: org.graylog2.rest.resources.system.IndexRangesResource - Concurrency level of this job reached: The maximum of parallel [org.graylog2.indexer.ranges.RebuildIndexRangesJob] is locked to <1> but <1> are running.

Maybe a queue of org.graylog2.indexer.ranges.RebuildIndexRangesJob rather than just quitting?

kroepke commented 9 years ago

Ah, ok.

Yeah we should probably change the process a little and only reprocess those we know have changed.

I'll try to get that done next week.

Thanks! On Jan 2, 2015 5:22 PM, "NickMeves" notifications@github.com wrote:

Right, the retention strategy kicks off and starts the rebuild index ranges job. The problem occurs when the rebuild index ranges job from when the deflector cycled is still running; the rebuild index range job from the retention strategy job (which will process the appropriate number of indices and not include the closed/deleted index) will quit without running:

ERROR: org.graylog2.rest.resources.system.IndexRangesResource - Concurrency level of this job reached: The maximum of parallel [org.graylog2.indexer.ranges.RebuildIndexRangesJob] is locked to but are running.

Maybe a queue of org.graylog2.indexer.ranges.RebuildIndexRangesJob rather than just quitting?

— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog2-server/issues/779#issuecomment-68538822 .

kroepke commented 9 years ago

pushing to 1.1 because this means more work

joschi commented 9 years ago

This should be fixed in Graylog 1.2.0. Please re-open the ticket if you still have this problem with Graylog 1.2.0.

cralston0 commented 8 years ago

I just encountered this on 1.3.3

2016-02-01T18:47:58.135Z INFO  [IndexRotationThread] Deflector index <foobar_223> should be rotated, Pointing deflector to new index now!
2016-02-01T18:47:58.135Z INFO  [Deflector] Cycling deflector to next index now.
2016-02-01T18:47:58.160Z INFO  [Deflector] Cycling from <foobar_223> to <foobar_224>
2016-02-01T18:47:58.160Z INFO  [Deflector] Creating index target <foobar_224>...
2016-02-01T18:47:58.181Z ERROR [IndexRotationThread] Couldn't point deflector to a new index
org.elasticsearch.indices.IndexAlreadyExistsException: [foobar_224] already exists

Tried manually deleting index 224 on the ES side with GL stopped and it got right back into this state when I started it again.

ghost commented 8 years ago

I just encountered this on 1.3.3.

2016-03-23T15:05:27.651Z ERROR [AnyExceptionClassMapper] Unhandled exception in REST resource org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task.

repeated a few times

016-03-23T15:06:00.246Z ERROR [AlertScannerThread] Skipping alert check that threw an exception. org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task.

and then

2016-03-23T15:06:21.007Z INFO [IndexRotationThread] Deflector index should be rotated, Pointing deflector to new index now! 2016-03-23T15:06:21.007Z INFO [Deflector] Cycling deflector to next index now. 2016-03-23T15:06:21.133Z INFO [Deflector] Cycling from to 2016-03-23T15:06:21.133Z INFO [Deflector] Creating index target ... 2016-03-23T15:06:22.659Z INFO [Deflector] Waiting for index allocation of 2016-03-23T15:06:23.770Z INFO [Deflector] Done! 2016-03-23T15:06:23.770Z INFO [Deflector] Pointing deflector to new target index.... 2016-03-23T15:06:25.278Z INFO [SystemJobManager] Submitted SystemJob [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob] 2016-03-23T15:06:25.278Z INFO [CreateNewSingleIndexRangeJob] Calculating ranges for index odop_911. 2016-03-23T15:06:25.278Z INFO [SystemJobManager] Submitted SystemJob [org.graylog2.indexer.SetIndexReadOnlyJob] 2016-03-23T15:06:25.278Z INFO [SystemJobManager] Submitted SystemJob [org.graylog2.indexer.ranges.CreateNewSingleIndexRangeJob] 2016-03-23T15:06:25.278Z INFO [Deflector] Done! 2016-03-23T15:06:25.278Z INFO [CreateNewSingleIndexRangeJob] Calculating ranges for index odop_912. 2016-03-23T15:06:38.314Z WARN [IndexHelper] Couldn't find latest deflector target index org.graylog2.database.NotFoundException: Index range for index not found.