Closed andyb-elastic closed 4 years ago
Pinging @elastic/es-core-infra
Actually this does look odd from the integ cluster logs, though it happens after the request failing that fails the test
[2018-07-23T22:10:15,863][ERROR][o.e.s.m.MustacheScriptEngine] Error running com.github.mustachejava.codes.DefaultMustache@56a43cff
com.github.mustachejava.MustacheException: Failed to get value for ctx @[query-template:1]
at com.github.mustachejava.codes.ValueCode.execute(ValueCode.java:74) ~[compiler-0.9.3.jar:?]
at com.github.mustachejava.codes.DefaultMustache.run(DefaultMustache.java:34) ~[compiler-0.9.3.jar:?]
at com.github.mustachejava.codes.DefaultCode.execute(DefaultCode.java:162) ~[compiler-0.9.3.jar:?]
at com.github.mustachejava.codes.DefaultMustache.execute(DefaultMustache.java:55) ~[compiler-0.9.3.jar:?]
at com.github.mustachejava.Mustache.execute(Mustache.java:38) ~[compiler-0.9.3.jar:?]
at org.elasticsearch.script.mustache.MustacheScriptEngine$MustacheExecutableScript.lambda$execute$0(MustacheScriptEngine.java:111) ~[lang-mustache-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at java.security.AccessController.doPrivileged(Native Method) ~[?:?]
at org.elasticsearch.script.mustache.MustacheScriptEngine$MustacheExecutableScript.execute(MustacheScriptEngine.java:110) [lang-mustache-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.xpack.watcher.common.text.TextTemplateEngine.render(TextTemplateEngine.java:57) [x-pack-watcher-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.xpack.watcher.actions.logging.ExecutableLoggingAction.execute(ExecutableLoggingAction.java:46) [x-pack-watcher-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.xpack.core.watcher.actions.ActionWrapper.execute(ActionWrapper.java:144) [x-pack-core-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.xpack.watcher.execution.ExecutionService.executeInner(ExecutionService.java:464) [x-pack-watcher-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.xpack.watcher.execution.ExecutionService.execute(ExecutionService.java:295) [x-pack-watcher-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.xpack.watcher.transport.actions.execute.TransportExecuteWatchAction$1.doRun(TransportExecuteWatchAction.java:154) [x-pack-watcher-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
Caused by: java.lang.IllegalArgumentException: Iterable object is self-referencing itself
at org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:253) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:256) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:256) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:256) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.common.util.CollectionUtils.ensureNoSelfReferences(CollectionUtils.java:231) ~[elasticsearch-6.4.0-SNAPSHOT.jar:6.4.0-SNAPSHOT]
at org.elasticsearch.script.mustache.CustomReflectionObjectHandler.stringify(CustomReflectionObjectHandler.java:160) ~[?:?]
at com.github.mustachejava.codes.ValueCode.execute(ValueCode.java:69) ~[?:?]
... 20 more
Good catch @andyb-elastic.
I have enabled x-pack/qa/smoke-test-watcher
again.
I have covered the issues that I know about, but there's a comment above
This test has failed a handful of times in the last three months but for different resaons.
I don't have logs for all those, so it's possible that some still exist. Please re-mute this test if it turns out to still be flaky.
Another failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2183/console
./gradlew :x-pack:qa:smoke-test-watcher:integTestRunner -Dtests.seed=921232DB4961CB4E -Dtests.class=org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT -Dtests.method="testMonitorClusterHealth" -Dtests.security.manager=true -Dtests.locale=en-NZ -Dtests.timezone=Europe/Warsaw -Dcompiler.java=11 -Druntime.java=8
12:06:36 FAILURE 2.58s | SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth <<< FAILURES!
12:06:36 > Throwable #1: java.lang.AssertionError: Watch count (from _watcher/stats)
12:06:36 > Expected: is <0>
12:06:36 > but: was <1>
12:06:36 > at __randomizedtesting.SeedInfo.seed([921232DB4961CB4E:61CAABBBBA00A4C4]:0)
12:06:36 > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
12:06:36 > at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.assertWatchCount(SmokeTestWatcherTestSuiteIT.java:230)
12:06:36 > at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth(SmokeTestWatcherTestSuiteIT.java:162)
12:06:36 > at java.lang.Thread.run(Thread.java:748)
12:06:36 Completed [3/3] in 2.61s, 1 test, 1 failure <<< FAILURES!
Thanks, I'll look today and re-mute if needed.
By piecing together the output from the test with the cluster log, I have a theory about what's happening here:
#TEST# [2019-02-21T19:06:35,018][INFO ][o.e.s.SmokeTestWatcherTestSuiteIT] [testMonitorClusterHealth] Found [0] hits in watcher history
#NODE# [2019-02-21T18:06:35,070][DEBUG][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [node-0] triggered job [cluster_health_watch] at [2019-02-21T18:06:35.070Z] (scheduled time was [2019-02-21T18:06:34.919Z])
#NODE# [2019-02-21T18:06:35,081][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [.triggered_watches] creating index, cause [auto(bulk api)], templates [.triggered_watches], shards [1]/[1], mappings [doc]
#NODE# [2019-02-21T18:06:35,083][INFO ][o.e.c.r.a.AllocationService] [node-0] updating number_of_replicas to [0] for indices [.triggered_watches]
#NODE# [2019-02-21T18:06:35,155][INFO ][o.e.c.r.a.AllocationService] [node-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.triggered_watches][0]] ...]).
#NODE# [2019-02-21T18:06:35,184][DEBUG][o.e.x.w.e.ExecutionService] [node-0] executing watch [cluster_health_watch]
#NODE# [2019-02-21T18:06:35,212][INFO ][o.e.x.w.a.l.ExecutableLoggingAction] [node-0] executed
#NODE# [2019-02-21T18:06:35,215][DEBUG][o.e.x.w.WatcherIndexingListener] [node-0] adding watch [cluster_health_watch] to trigger service
#NODE# [2019-02-21T18:06:35,231][INFO ][o.e.c.m.MetaDataCreateIndexService] [node-0] [.watcher-history-9-2019.02.21] creating index, cause [auto(bulk api)], templates [.watch-history-9], shards [1]/[0], mappings [doc]
#NODE# [2019-02-21T18:06:35,299][INFO ][o.e.c.r.a.AllocationService] [node-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.watcher-history-9-2019.02.21][0]] ...])
#NODE# [2019-02-21T18:06:35,327][INFO ][o.e.c.m.MetaDataMappingService] [node-0] [.watcher-history-9-2019.02.21/KRXX-Mz_QzeQXiarK-39yw] update_mapping [doc]
#NODE# [2019-02-21T18:06:35,373][DEBUG][o.e.x.w.e.ExecutionService] [node-0] finished [cluster_health_watch]/[cluster_health_watch_53e3c983-2cc4-48e8-85e2-e27e15af89b9-2019-02-21T18:06:35.072183Z]
#TEST# [2019-02-21T19:06:36,043][INFO ][o.e.s.SmokeTestWatcherTestSuiteIT] [testMonitorClusterHealth] Refreshing watcher history
#TEST# [2019-02-21T19:06:36,079][INFO ][o.e.s.SmokeTestWatcherTestSuiteIT] [testMonitorClusterHealth] Searching watcher history
#NODE# [2019-02-21T18:06:36,078][DEBUG][o.e.x.w.t.s.e.TickerScheduleTriggerEngine] [node-0] triggered job [cluster_health_watch] at [2019-02-21T18:06:36.077Z] (scheduled time was [2019-02-21T18:06:35.919Z])
#NODE# [2019-02-21T18:06:36,089][DEBUG][o.e.x.w.e.ExecutionService] [node-0] executing watch [cluster_health_watch]
#TEST# [2019-02-21T19:06:36,093][INFO ][o.e.s.SmokeTestWatcherTestSuiteIT] [testMonitorClusterHealth] Found [1] hits in watcher history
#TEST# [2019-02-21T19:06:36,093][INFO ][o.e.s.SmokeTestWatcherTestSuiteIT] [testMonitorClusterHealth] Watch hit 0 has id [cluster_health_watch] (expecting [cluster_health_watch])
#NODE# [2019-02-21T18:06:36,098][DEBUG][o.e.x.w.WatcherIndexingListener] [node-0] adding watch [cluster_health_watch] to trigger service
#NODE# [2019-02-21T18:06:36,103][DEBUG][o.e.x.w.e.ExecutionService] [node-0] failed to execute watch [cluster_health_watch]
org.elasticsearch.index.engine.VersionConflictEngineException: [doc][cluster_health_watch]: version conflict, required seqNo [1], primary term [1]. current document has seqNo [2] and primary term [1]
#TEST# [2019-02-21T19:06:36,306][INFO ][o.e.s.SmokeTestWatcherTestSuiteIT] [testMonitorClusterHealth] after test
#TEST Watch count (from _watcher/stats) Expected: is <0> but: was <1>
It seems there was a concurrency issue between deleting the watch (somewhere between 19:06:36,093
and 19:06:36,306
in the test) and an executing of the test being written to the .watches
index (around 18:06:36,098
on the node).
The watcher stats are driven from the TriggerService
, and the WatcherIndexingListener
updates the TriggerService
as part of the preIndex
method.
So delete watch removes the watch from the TriggerService
which removes it from the stats count, but then the watch finishes executing which tries to update the .watches
index, and that causes it to be added back to the TriggerService
.
The indexing to the watches
index fails because of a version conflict (the delete), and the postIndex
listener is supposed to detect that and clean up the TriggerService
stats.
And it might do, except the test has already failed by then, so the cluster shuts down.
I think an assertBusy
on the assertWatchCount
will do the trick, but it's a bit ugly.
I don't have time to fix it today, but will try to get a PR up next week.
Failed again today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+periodic/800
It does not reproduce with:
./gradlew :x-pack:qa:smoke-test-watcher:integTestRunner -Dtests.seed=37D3466CC78964DE -Dtests.class=org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT -Dtests.method="testMonitorClusterHealth" -Dtests.security.manager=true -Dtests.locale=ro -Dtests.timezone=PST -Dcompiler.java=11 -Druntime.java=8
I muted the test again in 69fe79a85ea27ad6228ecde02d54cad83b462f6d
It just failed again with the above commit: https://gradle-enterprise.elastic.co/s/4665fyvjoklhy
I'm going to mute the test for now and look into the logs that have been produced.
Failure with NPE, but I think this might be from the test itself
https://gradle-enterprise.elastic.co/s/mjpliukhzhawm
./gradlew ':x-pack:qa:smoke-test-watcher:integTestRunner' --tests "org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth" \
-Dtests.seed=849FC7F5DD295954 \
-Dtests.security.manager=true \
-Dtests.locale=th-TH \
-Dtests.timezone=Asia/Nicosia \
-Dcompiler.java=13
java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([849FC7F5DD295954:77475E952E4836DE]:0)
at org.elasticsearch.smoketest.SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth(SmokeTestWatcherTestSuiteIT.java:159)
Thanks @polyfractal. This test failed a few times. I will mute it and investigate the logs.
The NPE is caused by the fact that history docs where returned with a state :not_executed_already_queued
. This events docs have a result
json object, which causes the code to retrieve hits.hits.0._source.result.condition.met
to return null
. The test then tries to cast this is a boolean
, which results in a NPE. The history doc that does contain a result
object is returned, but not as the first hit. I will fix this shortly.
This test has failed a few times over the weekend. I will mute this test and analyse the failures.
I think #51466 has fixed this issue. If this test fails again then a new issue should be opened and possible reference this issue if the failure looks similar.
Doesn't reproduce, looks like the test cluster had a problem but there's nothing too interesting in the logs. This test has failed a handful of times in the last three months but for different resaons. Not sure there's anything to do here
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=centos/1199/console
CI logs build-1199-watcher.txt
Integ cluster logs run.log