Watcher: Mysterious rolling upgrade failure

nik9000 commented 6 years ago

This rolling upgrade build failed fairly mysteriously. This is what the failure looks like:

15:24:46   1> [2018-08-26T15:24:37,459][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test] Stash dump on test failure [{
15:24:46   1>   "stash" : {
15:24:46   1>     "record_id" : "my_watch_0be15ae1-d0c1-4516-b2fc-7291a51c00a6-2018-08-26T15:24:37.281Z",
15:24:46   1>     "body" : {
15:24:46   1>       "_id" : "my_watch_0be15ae1-d0c1-4516-b2fc-7291a51c00a6-2018-08-26T15:24:37.281Z",
15:24:46   1>       "watch_record" : {
15:24:46   1>         "watch_id" : "my_watch",
15:24:46   1>         "node" : "j6O-Ii5MRcuZismjV2mJPA",
15:24:46   1>         "state" : "failed",
15:24:46   1>         "user" : "test_user",
15:24:46   1>         "status" : {
15:24:46   1>           "state" : {
15:24:46   1>             "active" : true,
15:24:46   1>             "timestamp" : "2018-08-26T15:24:36.992Z"
15:24:46   1>           },
15:24:46   1>           "actions" : {
15:24:46   1>             "logging" : {
15:24:46   1>               "ack" : {
15:24:46   1>                 "timestamp" : "2018-08-26T15:24:36.992Z",
15:24:46   1>                 "state" : "awaits_successful_execution"
15:24:46   1>               }
15:24:46   1>             }
15:24:46   1>           },
15:24:46   1>           "execution_state" : "failed",
15:24:46   1>           "version" : 1
15:24:46   1>         },
15:24:46   1>         "trigger_event" : {
15:24:46   1>           "type" : "manual",
15:24:46   1>           "triggered_time" : "2018-08-26T15:24:37.279Z",
15:24:46   1>           "manual" : {
15:24:46   1>             "schedule" : {
15:24:46   1>               "scheduled_time" : "2018-08-26T15:24:37.279Z"
15:24:46   1>             }
15:24:46   1>           }
15:24:46   1>         },
15:24:46   1>         "input" : {
15:24:46   1>           "simple" : { }
15:24:46   1>         },
15:24:46   1>         "condition" : {
15:24:46   1>           "always" : { }
15:24:46   1>         },
15:24:46   1>         "result" : {
15:24:46   1>           "execution_time" : "2018-08-26T15:24:37.281Z",
15:24:46   1>           "execution_duration" : 1535297077283,
15:24:46   1>           "actions" : [ ]
15:24:46   1>         },
15:24:46   1>         "exception" : {
15:24:46   1>           "type" : "illegal_state_exception",
15:24:46   1>           "reason" : "could not register execution [my_watch]. current executions are sealed and forbid registrations of additional executions."
15:24:46   1>         }
15:24:46   1>       }
15:24:46   1>     }
15:24:46   1>   }
15:24:46   1> }]
15:24:46   1> [2018-08-26T15:24:37,502][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test] [p0=old_cluster/60_watcher/CRUD watch APIs] after test
15:24:46 FAILURE 0.96s | UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/60_watcher/CRUD watch APIs} <<< FAILURES!
15:24:46    > Throwable #1: java.lang.AssertionError: Failure at [old_cluster/60_watcher:43]: watch_record.state didn't match expected value:
15:24:46    >             watch_record.state: expected [executed] but was [failed]

This is in the "old" cluster so the cluster state is empty, but we still get an error as though the watch was running.

elasticmachine commented 6 years ago

Pinging @elastic/es-core-infra

polyfractal commented 6 years ago

Pasting from an old issue (in old xpack repo) with a similar error. I think it was a concurrency issue for the other one but was never resolved.

I tried a few approaches to fix this but wasn't satisfied with any of them. Jotting down notes for now:

The issue is that ExecutionService#clearExecutions() is called on stop or pause, which seals currentExecutions then overwrites it with a new instantiation. But it's possible for a manual execution of a watch to slip in between the sealing and the new object, which throws the exception seen here.

There are a few ways this could be fixed, unsure the best route:

Instead of throwing an exception, CurrentExceptions#put() could return Optional to signal if it was able to add to the map, and if not, the calling code could just log a message and abort

Allow manually executed Watches to be added to the currently executing list despite being sealed

Modify CurrentExceptions so that it can be reset internally, rather than externally overwritten after being drained. This would change up the sealing dynamics and allow waiting for the drain to finish, etc.

In that issue, @DaveCTurner followed up with:

Would it work to do something like the following?

 diff --git a/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java b/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java
 index 407ab9a7e5..c3f73740e8 100644
 --- a/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java
 +++ b/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java
 @@ -554,8 +554,9 @@ public class ExecutionService extends AbstractComponent {
       * This is needed, because when this method is called, watcher keeps running, so sealing executions would be a bad idea
      */
     public void clearExecutions() {
-        currentExecutions.sealAndAwaitEmpty(maxStopTimeout);
+        final CurrentExecutions previousExecutions = currentExecutions;
         currentExecutions = new CurrentExecutions();
+        previousExecutions.sealAndAwaitEmpty(maxStopTimeout);
     }

     // the watch execution task takes another runnable as parameter

NB I'm not sure this actually works as-is without further synchronisation.

Not sure if it's the same issue (manual executions racing), but if there are concurrency issues with this code it might be manifesting in the above test too.

alpar-t commented 6 years ago

Another instance: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/27/console

jkakavas commented 5 years ago

And another one today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2197/consoleText in master intake

This fails once in a while: https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-6M,mode:quick,to:now))&_a=(columns:!(_source),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'%22old_cluster%2F60_watcher%2FCRUD%20watch%20APIs%22'),sort:!(process.time-start,desc))

davidkyle commented 5 years ago

Another instance in 7.0 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+artifactory/133/console

davidkyle commented 5 years ago

Another failure in 7.0 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+artifactory/136/testReport/junit/org.elasticsearch.upgrades/UpgradeClusterClientYamlTestSuiteIT/test__p0_old_cluster_60_watcher_CRUD_watch_APIs_/ so I backported the muting to 7.0 1948702f4ab3a7a2bfc243ca0328173cac4a487c

There are only 2 watcher tests in the yml suite both are muted leaving only WatcherRestartIT for Watcher upgrade test coverage and that test appears pretty minimal.

jakelandis commented 5 years ago

Un-muted this test on PR #42377 to obtain additional logs.

If (when?) this test fails again please obtain the following information before muting the test:

Copy of the relevant failure
Copy of the reproduce line
The Jenkins build link
The Gradle scan link
The relevant cluster logs from "Google Cloud Storage Upload Report" (link found in Jenkins build)

jkakavas commented 5 years ago

This failed in a PR build:

Jenkins build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-bwc/6473/ Gradle scan: https://gradle.com/s/7smiqhxdoghp2 Repro:

./gradlew :x-pack:qa:rolling-upgrade:v7.3.0#oldClusterTestRunner --tests "org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/60_watcher/CRUD watch APIs}" -Dtests.seed=8DFA7D80E04774A7 -Dtests.security.manager=true -Dtests.locale=ta-MY -Dtests.timezone=Asia/Krasnoyarsk -Dcompiler.java=12 -Druntime.java=11 -Dtests.rest.suite=old_cluster

logs: https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+pull-request-bwc/6473/hprof-files.tar.gz https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+pull-request-bwc/6473/syserr-files.tar.gz https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+pull-request-bwc/6473/log-files.tar.gz

DaveCTurner commented 5 years ago

In today's master I found a bunch of tests that are being skipped and which link to this issue:

$ find . -type f -name '*.yml' | xargs grep -e 33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/old_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/old_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/mixed_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/mixed_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/old_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/old_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/mixed_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/mixed_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/old_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/old_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/mixed_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/mixed_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/upgraded_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/upgraded_cluster/60_watcher.yml:      reason: https://github.com/elastic/elasticsearch/issues/33185

It looks like GitHub interpreted the phrase This initial PR does not attempt to fix #33185. wrongly 🤦‍♀ . Re-opening.

martijnvg commented 4 years ago

After merging in the pr that enables the watcher rolling upgrade tests, these tests haven't yet failed. I'm going to also enable these tests in the 7 dot x branch.

martijnvg commented 4 years ago

Today the first real failure occurred:

org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT > test {p0=old_cluster/60_watcher/CRUD watch APIs} FAILED |  
-- | --
  | java.lang.AssertionError: Failure at [old_cluster/60_watcher:41]: watch_record.state didn't match expected value: |  
  | watch_record.state: expected String [executed] but was String [failed] |  
  | at __randomizedtesting.SeedInfo.seed([885817E4E7A82A5D:C283E495447A5]:0) |  
  | at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:405) |  
  | at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:382) |  
  | at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Last response:

[2020-02-03T03:01:13,678][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test] Stash dump on test failure [{ |  
-- | --
  | 1>   "stash" : { |  
  | 1>     "record_id" : "my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z", |  
  | 1>     "body" : { |  
  | 1>       "_id" : "my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z", |  
  | 1>       "watch_record" : { |  
  | 1>         "watch_id" : "my_watch", |  
  | 1>         "node" : "VhrcuRKYSkmoP9P9ilKjTw", |  
  | 1>         "state" : "failed", |  
  | 1>         "user" : "test_user", |  
  | 1>         "status" : { |  
  | 1>           "state" : { |  
  | 1>             "active" : true, |  
  | 1>             "timestamp" : "2020-02-03T01:01:13.171Z" |  
  | 1>           }, |  
  | 1>           "actions" : { |  
  | 1>             "logging" : { |  
  | 1>               "ack" : { |  
  | 1>                 "timestamp" : "2020-02-03T01:01:13.171Z", |  
  | 1>                 "state" : "awaits_successful_execution" |  
  | 1>               } |  
  | 1>             } |  
  | 1>           }, |  
  | 1>           "execution_state" : "failed", |  
  | 1>           "version" : 1 |  
  | 1>         }, |  
  | 1>         "trigger_event" : { |  
  | 1>           "type" : "manual", |  
  | 1>           "triggered_time" : "2020-02-03T01:01:13.475Z", |  
  | 1>           "manual" : { |  
  | 1>             "schedule" : { |  
  | 1>               "scheduled_time" : "2020-02-03T01:01:13.475Z" |  
  | 1>             } |  
  | 1>           } |  
  | 1>         }, |  
  | 1>         "input" : { |  
  | 1>           "simple" : { } |  
  | 1>         }, |  
  | 1>         "condition" : { |  
  | 1>           "always" : { } |  
  | 1>         }, |  
  | 1>         "result" : { |  
  | 1>           "execution_time" : "2020-02-03T01:01:13.478Z", |  
  | 1>           "execution_duration" : 1580691673483, |  
  | 1>           "actions" : [ ] |  
  | 1>         }, |  
  | 1>         "exception" : { |  
  | 1>           "type" : "illegal_state_exception", |  
  | 1>           "reason" : "could not register execution [my_watch]. current executions are sealed and forbid registrations of additional executions." |  
  | 1>         } |  
  | 1>       } |  
  | 1>     } |  
  | 1>   } |  
  | 1> }]

Relevant build logs on node executing watch:

[2020-02-03T01:00:58,389][INFO ][o.e.x.w.WatcherService   ] [v6.8.7-1] stopping watch service, reason [watcher manually marked to shutdown by cluster state update]
[2020-02-03T01:01:13,312][INFO ][o.e.x.w.WatcherService   ] [v6.8.7-1] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-02-03T01:01:13,342][DEBUG][o.e.x.w.WatcherService   ] [v6.8.7-1] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-02-03T01:01:13,371][DEBUG][o.e.x.w.WatcherIndexingListener] [v6.8.7-1] adding watch [my_watch] to trigger service
[2020-02-03T01:01:13,482][INFO ][o.e.x.w.WatcherService   ] [v6.8.7-1] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-02-03T01:01:13,487][DEBUG][o.e.x.w.e.ExecutionService] [v6.8.7-1] failed to execute watch [my_watch]
java.lang.IllegalStateException: could not register execution [my_watch]. current executions are sealed and forbid registrations of additional executions.
        at org.elasticsearch.xpack.core.watcher.support.Exceptions.illegalState(Exceptions.java:26) ~[x-pack-core-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
        at org.elasticsearch.xpack.watcher.execution.CurrentExecutions.put(CurrentExecutions.java:41) ~[x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
        at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:282) [x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
        at org.elasticsearch.xpack.watcher.transport.actions.execute.TransportExecuteWatchAction$1.doRun(TransportExecuteWatchAction.java:164) [x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
        at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:617) [x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
[2020-02-03T01:01:13,506][DEBUG][o.e.x.w.WatcherService   ] [v6.8.7-1] Loaded [0] watches for execution
[2020-02-03T01:01:13,507][DEBUG][o.e.x.w.WatcherService   ] [v6.8.7-1] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-02-03T01:01:13,660][ERROR][o.e.x.w.Watcher          ] [v6.8.7-1] triggered watches could not be deleted [my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z], failure [[.triggered_watches] IndexNotFoundException[no such index]]
[2020-02-03T01:01:13,661][DEBUG][o.e.x.w.e.ExecutionService] [v6.8.7-1] finished [my_watch]/[my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z]

Build url: https://gradle-enterprise.elastic.co/s/hbiysvknha5wi/

martijnvg commented 4 years ago

I think that after the watch gets created when running against old cluster then we should wait until .watches index is at least yellow and watcher is started. This way we avoid executing a watch before watcher is ready to execute.

martijnvg commented 4 years ago

The latest failure happened a couple of times in the last week. I've opened #52139 to address it.

martijnvg commented 4 years ago

I'm closing this issue, these commits (^), seem to have stabilised this test.

elastic / elasticsearch

Watcher: Mysterious rolling upgrade failure #33185