Closed NickMeves closed 9 years ago
matches the stuff we need to do for #833 pretty well
Actually the retention should trigger a new index ranges rebuild. That might take too long, though, because it actually recalculates it for all indices.
Right, the retention strategy kicks off and starts the rebuild index ranges job. The problem occurs when the rebuild index ranges job from when the deflector cycled is still running; the rebuild index range job from the retention strategy job (which will process the appropriate number of indices and not include the closed/deleted index) will quit without running:
ERROR: org.graylog2.rest.resources.system.IndexRangesResource - Concurrency level of this job reached: The maximum of parallel [org.graylog2.indexer.ranges.RebuildIndexRangesJob] is locked to <1> but <1> are running.
Maybe a queue of org.graylog2.indexer.ranges.RebuildIndexRangesJob rather than just quitting?
Ah, ok.
Yeah we should probably change the process a little and only reprocess those we know have changed.
I'll try to get that done next week.
Thanks! On Jan 2, 2015 5:22 PM, "NickMeves" notifications@github.com wrote:
Right, the retention strategy kicks off and starts the rebuild index ranges job. The problem occurs when the rebuild index ranges job from when the deflector cycled is still running; the rebuild index range job from the retention strategy job (which will process the appropriate number of indices and not include the closed/deleted index) will quit without running:
ERROR: org.graylog2.rest.resources.system.IndexRangesResource - Concurrency level of this job reached: The maximum of parallel [org.graylog2.indexer.ranges.RebuildIndexRangesJob] is locked to but are running.
Maybe a queue of org.graylog2.indexer.ranges.RebuildIndexRangesJob rather than just quitting?
— Reply to this email directly or view it on GitHub https://github.com/Graylog2/graylog2-server/issues/779#issuecomment-68538822 .
pushing to 1.1 because this means more work
This should be fixed in Graylog 1.2.0. Please re-open the ticket if you still have this problem with Graylog 1.2.0.
I just encountered this on 1.3.3
2016-02-01T18:47:58.135Z INFO [IndexRotationThread] Deflector index <foobar_223> should be rotated, Pointing deflector to new index now!
2016-02-01T18:47:58.135Z INFO [Deflector] Cycling deflector to next index now.
2016-02-01T18:47:58.160Z INFO [Deflector] Cycling from <foobar_223> to <foobar_224>
2016-02-01T18:47:58.160Z INFO [Deflector] Creating index target <foobar_224>...
2016-02-01T18:47:58.181Z ERROR [IndexRotationThread] Couldn't point deflector to a new index
org.elasticsearch.indices.IndexAlreadyExistsException: [foobar_224] already exists
Tried manually deleting index 224 on the ES side with GL stopped and it got right back into this state when I started it again.
I just encountered this on 1.3.3.
2016-03-23T15:05:27.651Z ERROR [AnyExceptionClassMapper] Unhandled exception in REST resource org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task.
repeated a few times
016-03-23T15:06:00.246Z ERROR [AlertScannerThread] Skipping alert check that threw an exception. org.elasticsearch.ElasticsearchTimeoutException: Timeout waiting for task.
and then
2016-03-23T15:06:21.007Z INFO [IndexRotationThread] Deflector index
v0.91.3
After the deflector index is cycled a rebuild index ranges jobs kicks off.
Occasionally while that job is still running, the retention strategy will kick off since we have more indices than our max (this is every 5 minutes according to periodical/IndexRetentionThread.java?). The rebuild index ranges job portion of the retention strategy will fail due to maximum concurrency of the job being reached.
Now the first rebuild index ranges finishes and I have ranges for 21 indices reported. But I only have 20 indices since retention closed/deleted my oldest. Now any "Search in all messages" searches will fail since it will try to search a closed or deleted index until the next rebuild index ranges kicks off and gets it right (may be a few hours or days).
As a workaround I am currently monitoring the logs and kicking off a rebuild ranges job via the API if I see this collision.