elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.59k stars 24.63k forks source link

[CI] WatcherRestartIT testWatcherRestart failing #79895

Closed bpintea closed 1 year ago

bpintea commented 2 years ago

Build scan: https://gradle-enterprise.elastic.co/s/umosjq57hlvzk/tests/:x-pack:qa:rolling-upgrade:v7.12.1%23oneThirdUpgradedTest/org.elasticsearch.upgrades.WatcherRestartIT/testWatcherRestart

Reproduction line: ./gradlew ':x-pack:qa:rolling-upgrade:v7.12.1#oneThirdUpgradedTest' -Dtests.class="org.elasticsearch.upgrades.WatcherRestartIT" -Dtests.method="testWatcherRestart" -Dtests.seed=1469FD150430CE11 -Dtests.bwc=true -Dtests.locale=no-NO -Dtests.timezone=PST8PDT -Druntime.java=8

Applicable branches: 7.16

Reproduces locally?: Didn't try

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.upgrades.WatcherRestartIT&tests.test=testWatcherRestart

Failure excerpt:

org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:33261], URI [/_watcher/_stop], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update_watcher_manually_stopped) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update_watcher_manually_stopped) within 30s"},"status":503}

  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:325)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:295)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:301)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:269)
  at org.elasticsearch.upgrades.WatcherRestartIT.testWatcherRestart(WatcherRestartIT.java:36)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:748)
elasticmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

masseyke commented 2 years ago

I've tried several times to reproduce this with no luck.

benwtrent commented 2 years ago

Failed again: https://gradle-enterprise.elastic.co/s/fetuxpkqwegfo

trace:

org.elasticsearch.upgrades.WatcherRestartIT > testWatcherRestart FAILED
    java.lang.AssertionError: 
    Expected: not a string containing "\"watcher_state\":\"stopped\""
         but: was "{\"_nodes\":{\"total\":3,\"successful\":3,\"failed\":0},\"cluster_name\":\"v7.9.3\",\"manually_stopped\":false,\"stats\":[{\"node_id\":\"oBkeLHbySVy7kQtaxg09kQ\",\"watcher_state\":\"stopped\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":1}},{\"node_id\":\"T8VQB60TQHOpLWR6mIhB0Q\",\"watcher_state\":\"started\",\"watch_count\":0,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":0}},{\"node_id\":\"NDzwUvKOTbCQtzwBGPx6jQ\",\"watcher_state\":\"started\",\"watch_count\":1,\"execution_thread_pool\":{\"queue_size\":0,\"max_size\":1}}]}"
        at __randomizedtesting.SeedInfo.seed([D1D81A0B0B2C0594:448014844030DCCD]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.upgrades.WatcherRestartIT.lambda$ensureWatcherStarted$3(WatcherRestartIT.java:180)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1123)
        at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1096)
        at org.elasticsearch.upgrades.WatcherRestartIT.ensureWatcherStarted(WatcherRestartIT.java:174)
        at org.elasticsearch.upgrades.WatcherRestartIT.testWatcherRestart(WatcherRestartIT.java:42)

reproduce line:

./gradlew ':x-pack:qa:rolling-upgrade:v7.9.3#twoThirdsUpgradedTest' -Dtests.class="org.elasticsearch.upgrades.WatcherRestartIT" -Dtests.method="testWatcherRestart" -Dtests.seed=D1D81A0B0B2C0594 -Dtests.bwc=true -Dtests.locale=nl-BE -Dtests.timezone=Etc/GMT-8 -Druntime.java=8

Attached are the cluster logs from the test failure

120.zip

dimitris-athanasiou commented 2 years ago

Failed again in the same way that Ben described above in https://gradle-enterprise.elastic.co/s/ylll4e7wtprmm

Cluster logs attached as well. 161.zip

martijnvg commented 2 years ago

This test has been muted in 7.16 and 7.17 branches. It is failing in these branches very often and that is disruptive.

jakelandis commented 2 years ago

Pretty high confidence that the most recent failures are due the same root cause as https://github.com/elastic/elasticsearch/issues/81110#issuecomment-1002234837 as evident by "missing watcher index templates, not starting watcher service" and which step the failures happen. I would suggest to unmute this once that issue is resolved.

(also the OP error looks transient and not too concerning)

gmarouli commented 1 year ago

Since this is related to versions < 7.16 it doesn't seem relevant anymore. The test on main currently is not muted and there haven't been any failures reported recently.