Closed nik9000 closed 4 years ago
Pinging @elastic/es-core-infra
Pasting from an old issue (in old xpack repo) with a similar error. I think it was a concurrency issue for the other one but was never resolved.
I tried a few approaches to fix this but wasn't satisfied with any of them. Jotting down notes for now:
The issue is that ExecutionService#clearExecutions() is called on stop or pause, which seals currentExecutions then overwrites it with a new instantiation. But it's possible for a manual execution of a watch to slip in between the sealing and the new object, which throws the exception seen here.
There are a few ways this could be fixed, unsure the best route:
- Instead of throwing an exception, CurrentExceptions#put() could return Optional
to signal if it was able to add to the map, and if not, the calling code could just log a message and abort - Allow manually executed Watches to be added to the currently executing list despite being sealed
- Modify CurrentExceptions so that it can be reset internally, rather than externally overwritten after being drained. This would change up the sealing dynamics and allow waiting for the drain to finish, etc.
In that issue, @DaveCTurner followed up with:
Would it work to do something like the following?
diff --git a/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java b/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java
index 407ab9a7e5..c3f73740e8 100644
--- a/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java
+++ b/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/execution/ExecutionService.java
@@ -554,8 +554,9 @@ public class ExecutionService extends AbstractComponent {
* This is needed, because when this method is called, watcher keeps running, so sealing executions would be a bad idea
*/
public void clearExecutions() {
- currentExecutions.sealAndAwaitEmpty(maxStopTimeout);
+ final CurrentExecutions previousExecutions = currentExecutions;
currentExecutions = new CurrentExecutions();
+ previousExecutions.sealAndAwaitEmpty(maxStopTimeout);
}
// the watch execution task takes another runnable as parameter
NB I'm not sure this actually works as-is without further synchronisation.
Not sure if it's the same issue (manual executions racing), but if there are concurrency issues with this code it might be manifesting in the above test too.
And another one today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2197/consoleText in master intake
Another instance in 7.0 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+artifactory/133/console
Another failure in 7.0 https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+artifactory/136/testReport/junit/org.elasticsearch.upgrades/UpgradeClusterClientYamlTestSuiteIT/test__p0_old_cluster_60_watcher_CRUD_watch_APIs_/ so I backported the muting to 7.0 1948702f4ab3a7a2bfc243ca0328173cac4a487c
There are only 2 watcher tests in the yml suite both are muted leaving only WatcherRestartIT for Watcher upgrade test coverage and that test appears pretty minimal.
Un-muted this test on PR #42377 to obtain additional logs.
If (when?) this test fails again please obtain the following information before muting the test:
This failed in a PR build:
Jenkins build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-bwc/6473/ Gradle scan: https://gradle.com/s/7smiqhxdoghp2 Repro:
./gradlew :x-pack:qa:rolling-upgrade:v7.3.0#oldClusterTestRunner --tests "org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=old_cluster/60_watcher/CRUD watch APIs}" -Dtests.seed=8DFA7D80E04774A7 -Dtests.security.manager=true -Dtests.locale=ta-MY -Dtests.timezone=Asia/Krasnoyarsk -Dcompiler.java=12 -Druntime.java=11 -Dtests.rest.suite=old_cluster
logs: https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+pull-request-bwc/6473/hprof-files.tar.gz https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+pull-request-bwc/6473/syserr-files.tar.gz https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+pull-request-bwc/6473/log-files.tar.gz
In today's master
I found a bunch of tests that are being skipped and which link to this issue:
$ find . -type f -name '*.yml' | xargs grep -e 33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/old_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/old_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/mixed_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/mixed_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build/resources/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/old_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/old_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/mixed_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/mixed_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/build-idea/classes/test/rest-api-spec/test/upgraded_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/old_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/old_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/mixed_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/mixed_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/upgraded_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
./x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/upgraded_cluster/60_watcher.yml: reason: https://github.com/elastic/elasticsearch/issues/33185
It looks like GitHub interpreted the phrase This initial PR does not attempt to fix #33185.
wrongly 🤦♀ . Re-opening.
After merging in the pr that enables the watcher rolling upgrade tests, these tests haven't yet failed. I'm going to also enable these tests in the 7 dot x branch.
Today the first real failure occurred:
org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT > test {p0=old_cluster/60_watcher/CRUD watch APIs} FAILED |
-- | --
| java.lang.AssertionError: Failure at [old_cluster/60_watcher:41]: watch_record.state didn't match expected value: |
| watch_record.state: expected String [executed] but was String [failed] |
| at __randomizedtesting.SeedInfo.seed([885817E4E7A82A5D:C283E495447A5]:0) |
| at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:405) |
| at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:382) |
| at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Last response:
[2020-02-03T03:01:13,678][INFO ][o.e.u.UpgradeClusterClientYamlTestSuiteIT] [test] Stash dump on test failure [{ |
-- | --
| 1> "stash" : { |
| 1> "record_id" : "my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z", |
| 1> "body" : { |
| 1> "_id" : "my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z", |
| 1> "watch_record" : { |
| 1> "watch_id" : "my_watch", |
| 1> "node" : "VhrcuRKYSkmoP9P9ilKjTw", |
| 1> "state" : "failed", |
| 1> "user" : "test_user", |
| 1> "status" : { |
| 1> "state" : { |
| 1> "active" : true, |
| 1> "timestamp" : "2020-02-03T01:01:13.171Z" |
| 1> }, |
| 1> "actions" : { |
| 1> "logging" : { |
| 1> "ack" : { |
| 1> "timestamp" : "2020-02-03T01:01:13.171Z", |
| 1> "state" : "awaits_successful_execution" |
| 1> } |
| 1> } |
| 1> }, |
| 1> "execution_state" : "failed", |
| 1> "version" : 1 |
| 1> }, |
| 1> "trigger_event" : { |
| 1> "type" : "manual", |
| 1> "triggered_time" : "2020-02-03T01:01:13.475Z", |
| 1> "manual" : { |
| 1> "schedule" : { |
| 1> "scheduled_time" : "2020-02-03T01:01:13.475Z" |
| 1> } |
| 1> } |
| 1> }, |
| 1> "input" : { |
| 1> "simple" : { } |
| 1> }, |
| 1> "condition" : { |
| 1> "always" : { } |
| 1> }, |
| 1> "result" : { |
| 1> "execution_time" : "2020-02-03T01:01:13.478Z", |
| 1> "execution_duration" : 1580691673483, |
| 1> "actions" : [ ] |
| 1> }, |
| 1> "exception" : { |
| 1> "type" : "illegal_state_exception", |
| 1> "reason" : "could not register execution [my_watch]. current executions are sealed and forbid registrations of additional executions." |
| 1> } |
| 1> } |
| 1> } |
| 1> } |
| 1> }]
Relevant build logs on node executing watch:
[2020-02-03T01:00:58,389][INFO ][o.e.x.w.WatcherService ] [v6.8.7-1] stopping watch service, reason [watcher manually marked to shutdown by cluster state update]
[2020-02-03T01:01:13,312][INFO ][o.e.x.w.WatcherService ] [v6.8.7-1] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-02-03T01:01:13,342][DEBUG][o.e.x.w.WatcherService ] [v6.8.7-1] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-02-03T01:01:13,371][DEBUG][o.e.x.w.WatcherIndexingListener] [v6.8.7-1] adding watch [my_watch] to trigger service
[2020-02-03T01:01:13,482][INFO ][o.e.x.w.WatcherService ] [v6.8.7-1] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2020-02-03T01:01:13,487][DEBUG][o.e.x.w.e.ExecutionService] [v6.8.7-1] failed to execute watch [my_watch]
java.lang.IllegalStateException: could not register execution [my_watch]. current executions are sealed and forbid registrations of additional executions.
at org.elasticsearch.xpack.core.watcher.support.Exceptions.illegalState(Exceptions.java:26) ~[x-pack-core-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
at org.elasticsearch.xpack.watcher.execution.CurrentExecutions.put(CurrentExecutions.java:41) ~[x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:282) [x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
at org.elasticsearch.xpack.watcher.transport.actions.execute.TransportExecuteWatchAction$1.doRun(TransportExecuteWatchAction.java:164) [x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
at org.elasticsearch.xpack.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:617) [x-pack-watcher-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.7-SNAPSHOT.jar:6.8.7-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
[2020-02-03T01:01:13,506][DEBUG][o.e.x.w.WatcherService ] [v6.8.7-1] Loaded [0] watches for execution
[2020-02-03T01:01:13,507][DEBUG][o.e.x.w.WatcherService ] [v6.8.7-1] watch service has been reloaded, reason [new local watcher shard allocation ids]
[2020-02-03T01:01:13,660][ERROR][o.e.x.w.Watcher ] [v6.8.7-1] triggered watches could not be deleted [my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z], failure [[.triggered_watches] IndexNotFoundException[no such index]]
[2020-02-03T01:01:13,661][DEBUG][o.e.x.w.e.ExecutionService] [v6.8.7-1] finished [my_watch]/[my_watch_26c48e21-ab7e-4953-836a-f90d7ec7baea-2020-02-03T01:01:13.478Z]
Build url: https://gradle-enterprise.elastic.co/s/hbiysvknha5wi/
I think that after the watch gets created when running against old cluster then we should wait until .watches index is at least yellow and watcher is started. This way we avoid executing a watch before watcher is ready to execute.
The latest failure happened a couple of times in the last week. I've opened #52139 to address it.
I'm closing this issue, these commits (^), seem to have stabilised this test.
This rolling upgrade build failed fairly mysteriously. This is what the failure looks like:
This is in the "old" cluster so the cluster state is empty, but we still get an error as though the watch was running.