Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.37k stars 1.06k forks source link

Race condition in SetIndexReadOnlyAndCalculateRangeJob #5613

Open mpfz0r opened 5 years ago

mpfz0r commented 5 years ago

A quickly consecutive execution of rotating the active write index and deleting the old index triggers this exception:

2019-01-29 14:05:16,904 ERROR: org.graylog2.system.jobs.SystemJobManager - Unhandled error while running SystemJob <740cbf40-23c6-11e9-8d27-4292641f89fe> [org.graylog2.indexer.indices.jobs.SetIndexReadOnlyAndCalculateRangeJob]
java.lang.IllegalArgumentException: Cat response did not contain a JSON Array
    at io.searchbox.core.Cat.parseResponseBody(Cat.java:61) ~[jest-common-2.4.15+jackson.jar:?]
    at io.searchbox.action.AbstractAction.createNewElasticSearchResult(AbstractAction.java:71) ~[jest-common-2.4.15+jackson.jar:?]
    at io.searchbox.core.Cat.createNewElasticSearchResult(Cat.java:44) ~[jest-common-2.4.15+jackson.jar:?]
    at io.searchbox.core.Cat.createNewElasticSearchResult(Cat.java:16) ~[jest-common-2.4.15+jackson.jar:?]
    at io.searchbox.client.http.JestHttpClient.deserializeResponse(JestHttpClient.java:212) ~[jest-2.4.15+jackson.jar:?]
    at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:88) ~[jest-2.4.15+jackson.jar:?]
    at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:46) ~[classes/:?]
    at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:62) ~[classes/:?]
    at org.graylog2.indexer.indices.Indices.catIndices(Indices.java:581) ~[classes/:?]
    at org.graylog2.indexer.indices.Indices.getClosedIndices(Indices.java:520) ~[classes/:?]
    at org.graylog2.indexer.indices.Indices.isClosed(Indices.java:566) ~[classes/:?]
    at org.graylog2.indexer.indices.jobs.SetIndexReadOnlyAndCalculateRangeJob.execute(SetIndexReadOnlyAndCalculateRangeJob.java:66) ~[classes/:?]
    at org.graylog2.system.jobs.SystemJobManager$1.run(SystemJobManager.java:89) [classes/:?]
    at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) [metrics-core-4.0.3.jar:4.0.3]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_191]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_191]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]

Steps to Reproduce (for bugs)

  1. Go to an index set
  2. Click "Rotate active write index"
  3. Shortly thereafter delete one index from the set
bernd commented 5 years ago

Some context:

This happens because the SetIndexReadOnlyAndCalculateRangeJob starts with a delay of 30 seconds after the index has been rotated. (a workaround for an indexing issue) If the index gets deleted shortly after index rotation, the SetIndexReadOnlyAndCalculateRangeJob fails because the index is already gone and we cannot check if it's closed.

One possible solution is to check if the index exists before we check if it's closed in Indices#isClosed.

https://github.com/Graylog2/graylog2-server/blob/a8485efec2938c744061b50e5c75629d6d84b770/graylog2-server/src/main/java/org/graylog2/indexer/indices/Indices.java#L565-L567

This still leaves a window for a race condition of course. Another solution would be to use the regular API instead of the _cat API to do the closed check and handle errors properly. The _cat API error handling in the Jest ES client doesn't allow us to generate a good error here.