Open romseygeek opened 4 years ago
Pinging @elastic/es-core-features (:Core/Features/Watcher)
Increased timeout on for yellow state on https://github.com/elastic/elasticsearch/pull/48434 (will backport) and will look into the assertion error : https://gradle-enterprise.elastic.co/s/kcjrg2hoa7zqe/
another: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+bwc/1837/console
I think the last failures had the 60s
timeout in.
This test has been re-muted across branches and the timeout reduced back to the original 30s.
This test hasn't failed since it was enabled on 23th december 6th januari in the master branch.
I'm going to enable it on the 7.x branch too and if it doesn't fail for a while then I will close this issue.
This test failed twice yesterday on master and 7 dot x branches. I will mute the test now and investigate these failures with the additional logging that was added recently.
The failures are related around waiting for at least a yellow cluster state.
While the cluster state does look good to provided a green cluster, in this particular test,
it waiting for specific indices to be at least yellow and one index (bwc_watch_index
) just doesn't exist. This index should created by a watcher index action, but apparently this action doesn't get executed, which result in the bwc_watch_index
not being created and the health api returning a red status.
Unmuted test and when it fails again that test captures watcher stats as well.
The watcher debug logs, don't indicate that a watch executes, which could result in the index action not being executed. I think the watch may be stuck and hopefully the watcher stats will capture this.
This test hasn't failed for almost a month. I will close this issue, if this does fail again then a new issue should be opened.
@martijnvg I just saw this test still muted on 7.x, should your unmute commit 9d7b80f be backported or was this done so on purpose?
@cbuescher I think I forgot the backport... I will backport the commit you mention today.
Reopening the issue, as we had another failure on master today:
Log:https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/176/console Build Scans: https://gradle-enterprise.elastic.co/s/humaqzeduckx4
REPRODUCE WITH: ./gradlew ':x-pack:qa:full-cluster-restart:v8.0.0#upgradedClusterTest' --tests "org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher" \ -Dtests.seed=7EF337E5D8E436CC \ -Dtests.security.manager=true \ -Dtests.locale=ar-KW \ -Dtests.timezone=US/Central \ -Druntime.java=11
java.lang.AssertionError: |
-- | --
| Expected: <2> |
| but: was <3> |
at __randomizedtesting.SeedInfo.seed([7EF337E5D8E436CC:131EC4282885A13A]:0) |
-- | --
| | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |
| | at org.junit.Assert.assertThat(Assert.java:956) |
| | at org.junit.Assert.assertThat(Assert.java:923) |
| | at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:418) |
| | at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:207)
The failure doesn't reproduce locally.
The logging that was added to help debug this issue is still there today - this line: https://github.com/elastic/elasticsearch/blob/35460a5f8a14bf19540224159be90696497a994d/x-pack/qa/full-cluster-restart/build.gradle#L60
That line was added over a year ago in 106c3ce686b706535686733e1354e51eec619cab. It means that the server-side logs for the X-Pack full cluster restart tests are 99% watcher debug, making it hard to debug anything else. Is this extra debug level logging still required today?
@droberts195 I will remove that line. If needed we can enable watcher debug logging later when we get back to investigation this test failure (and not forgetting to remove it).
A different instance of this test failing: https://gradle-enterprise.elastic.co/s/alpg4ojedmfhg
org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED |
-- | --
| org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:44101], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |
| {"cluster_name":"v7.0.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} |
| at __randomizedtesting.SeedInfo.seed([7E0AB41A5FE5E699:13E747D7AF84716F]:0) |
| at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347) |
| at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) |
| at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) |
| at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:757) |
| at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:159)
org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED
org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:44101/], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout]
{"cluster_name":"v7.0.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
at __randomizedtesting.SeedInfo.seed([7E0AB41A5FE5E699:13E747D7AF84716F]:0)
at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313)
at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288)
at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:757)
at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:159)
Started happening again (failed 5 times over the last week) https://gradle-enterprise.elastic.co/s/4nqy4ic7pc2cc/tests/:x-pack:qa:full-cluster-restart:v7.3.1%23oldClusterTest/org.elasticsearch.xpack.restart.FullClusterRestartIT/testWatcher?top-execution=1
org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:38531], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |
Another failure, again due to timeout:
https://gradle-enterprise.elastic.co/s/o6w6vll7qjego
--
| org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED |
| org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:37476], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |
| {"cluster_name":"v7.3.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} |
| at __randomizedtesting.SeedInfo.seed([4A1898BDB88606D4:27F56B7048E79122]:0) |
| at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347) |
| at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) |
| at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) |
| at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:752) |
| at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:158)
org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED
org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:37476/], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout]
{"cluster_name":"v7.3.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
at __randomizedtesting.SeedInfo.seed([4A1898BDB88606D4:27F56B7048E79122]:0)
at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313)
at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288)
at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:752)
at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:158)
Another failure https://gradle-enterprise.elastic.co/s/htn2dtr4ajl5c
This one seems to be a genuine test failure
java.lang.AssertionError:
11:24:54 Expected: <2>
11:24:54 but: was <3>
11:24:54 at __randomizedtesting.SeedInfo.seed([8299F6568D3BC8E1:EF74059B7D5A5F17]:0)
11:24:54 at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
11:24:54 at org.junit.Assert.assertThat(Assert.java:956)
11:24:54 at org.junit.Assert.assertThat(Assert.java:923)
11:24:54 at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:722)
11:24:54 at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:200)
And another https://gradle-enterprise.elastic.co/s/snmnlk6itki4w, timeout this time
org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:33149], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout
{"cluster_name":"elasticsearch","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
Looks like this fits here: https://gradle-enterprise.elastic.co/s/tb7rzfmtfukac
Pinging @elastic/es-data-management (Team:Data Management)
I think that the version is getting incremented because watcher sometimes happens to run, and it updates the document in .watches
with status
fields. By default it does this every 500 ms (xpack.watcher.trigger.schedule.ticker.tick_interval
), so it seems believable that every once in a while this happens while the test is running. I'm going to change the test to assert that the version is at least 2.
The timeout failure is more interesting though. From the latest failure, it appears to be happening because the bwc_watch_index
, so the health check fails (the other two indices exist, according to the cluster state that's dumped out on failure). I have no idea why bwc_watch_index
would not exist though. I can see that the bwc_watch
was created. And according to the watcher stats that are dumped out on failure, watcher is up and running, and has run 29 watches (which probably maps to running this watch once per second for the 29 times it could before the 30-second timeout). There is no mention of bwc_watch
failing, and no mention of bwc_watch_index
being created or failing to be created. I haven't been able to reproduce this locally.
I realized I hadn't been paying attention when I said that the watch had run 29 times -- the threadpool max size was 29. In the run that failed there were only 2 watches in the watch stats. The test creates 3 watches. I dumped out the watcher stats in a successful run, and one node showed 2 watches and the other 1 watch (as you'd expect). So it appears that for some reason in this timeout run the bwc_watch
silently failed to be created in some way.
This has been failing a couple of times a day since it was re-enabled in #48000
Failures fall into two types; on 7.x and 7.5:
eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191023094703-5F65A9C2&_g=()
and on master:
eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191022054329-BF6E5EA6&_g=()