elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.57k stars 24.62k forks source link

[CI] FullClusterRestartIT#testWatcher failures #48381

Open romseygeek opened 4 years ago

romseygeek commented 4 years ago

This has been failing a couple of times a day since it was re-enabled in #48000

Failures fall into two types; on 7.x and 7.5:

org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:35174], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] {"cluster_name":"v6.7.0","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}   at __randomizedtesting.SeedInfo.seed([1CA518BE73754281:7148EB738314D577]:0)     at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:283)     at org.elasticsearch.client.RestClient.performRequest(RestClient.java:261)  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)  at org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:567)    at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:160)

eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191023094703-5F65A9C2&_g=()

and on master:

java.lang.AssertionError: 
Expected: <2>
     but: was <3>
    at __randomizedtesting.SeedInfo.seed([16B952622AA67F39:7B54A1AFDAC7E8CF]:0)
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
    at org.junit.Assert.assertThat(Assert.java:956)
    at org.junit.Assert.assertThat(Assert.java:923)
    at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:351)
    at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:176)

eg https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191022054329-BF6E5EA6&_g=()

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/Watcher)

jakelandis commented 4 years ago

Increased timeout on for yellow state on https://github.com/elastic/elasticsearch/pull/48434 (will backport) and will look into the assertion error : https://gradle-enterprise.elastic.co/s/kcjrg2hoa7zqe/

pugnascotia commented 4 years ago

Another failure: https://gradle-enterprise.elastic.co/s/rvzlp65switqy/tests/qapgjxqnlfyjk-juyey3tkpelm6

tlrx commented 4 years ago

Another failure: https://gradle-enterprise.elastic.co/s/e6ambkakrlg3y/tests/onumgmisjues4-juyey3tkpelm6?openStackTraces=WzBd

hendrikmuhs commented 4 years ago

another: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob+fast+bwc/1837/console

I think the last failures had the 60s timeout in.

jakelandis commented 4 years ago

This test has been re-muted across branches and the timeout reduced back to the original 30s.

martijnvg commented 4 years ago

This test hasn't failed since it was enabled on 23th december 6th januari in the master branch. I'm going to enable it on the 7.x branch too and if it doesn't fail for a while then I will close this issue.

martijnvg commented 4 years ago

This test failed twice yesterday on master and 7 dot x branches. I will mute the test now and investigate these failures with the additional logging that was added recently.

martijnvg commented 4 years ago

The failures are related around waiting for at least a yellow cluster state. While the cluster state does look good to provided a green cluster, in this particular test, it waiting for specific indices to be at least yellow and one index (bwc_watch_index) just doesn't exist. This index should created by a watcher index action, but apparently this action doesn't get executed, which result in the bwc_watch_index not being created and the health api returning a red status.

martijnvg commented 4 years ago

Unmuted test and when it fails again that test captures watcher stats as well.

The watcher debug logs, don't indicate that a watch executes, which could result in the index action not being executed. I think the watch may be stuck and hopefully the watcher stats will capture this.

martijnvg commented 4 years ago

This test hasn't failed for almost a month. I will close this issue, if this does fail again then a new issue should be opened.

cbuescher commented 4 years ago

@martijnvg I just saw this test still muted on 7.x, should your unmute commit 9d7b80f be backported or was this done so on purpose?

martijnvg commented 4 years ago

@cbuescher I think I forgot the backport... I will backport the commit you mention today.

mayya-sharipova commented 4 years ago

Reopening the issue, as we had another failure on master today:

Log:https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/176/console Build Scans: https://gradle-enterprise.elastic.co/s/humaqzeduckx4

REPRODUCE WITH: ./gradlew ':x-pack:qa:full-cluster-restart:v8.0.0#upgradedClusterTest' --tests "org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher" \ -Dtests.seed=7EF337E5D8E436CC \ -Dtests.security.manager=true \ -Dtests.locale=ar-KW \ -Dtests.timezone=US/Central \ -Druntime.java=11

java.lang.AssertionError: |  
-- | --
  | Expected: <2> |  
  | but: was <3> |  

at __randomizedtesting.SeedInfo.seed([7EF337E5D8E436CC:131EC4282885A13A]:0) |  
-- | --
  |   | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  |   | at org.junit.Assert.assertThat(Assert.java:956) |  
  |   | at org.junit.Assert.assertThat(Assert.java:923) |  
  |   | at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:418) |  
  |   | at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:207)

The failure doesn't reproduce locally.

droberts195 commented 3 years ago

The logging that was added to help debug this issue is still there today - this line: https://github.com/elastic/elasticsearch/blob/35460a5f8a14bf19540224159be90696497a994d/x-pack/qa/full-cluster-restart/build.gradle#L60

That line was added over a year ago in 106c3ce686b706535686733e1354e51eec619cab. It means that the server-side logs for the X-Pack full cluster restart tests are 99% watcher debug, making it hard to debug anything else. Is this extra debug level logging still required today?

martijnvg commented 3 years ago

@droberts195 I will remove that line. If needed we can enable watcher debug logging later when we get back to investigation this test failure (and not forgetting to remove it).

martijnvg commented 2 years ago

A different instance of this test failing: https://gradle-enterprise.elastic.co/s/alpg4ojedmfhg

org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED |  
-- | --
  | org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:44101], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |  
  | {"cluster_name":"v7.0.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} |  
  | at __randomizedtesting.SeedInfo.seed([7E0AB41A5FE5E699:13E747D7AF84716F]:0) |  
  | at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:757) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:159)

org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED   
    org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:44101/], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout]  
    {"cluster_name":"v7.0.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} 
        at __randomizedtesting.SeedInfo.seed([7E0AB41A5FE5E699:13E747D7AF84716F]:0) 
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)    
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) 
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) 
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:757)   
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:159)
arteam commented 2 years ago

Started happening again (failed 5 times over the last week) https://gradle-enterprise.elastic.co/s/4nqy4ic7pc2cc/tests/:x-pack:qa:full-cluster-restart:v7.3.1%23oldClusterTest/org.elasticsearch.xpack.restart.FullClusterRestartIT/testWatcher?top-execution=1

org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:38531], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |  
benwtrent commented 1 year ago

Another failure, again due to timeout:

https://gradle-enterprise.elastic.co/s/o6w6vll7qjego

--
  | org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED |  
  | org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:37476], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout] |  
  | {"cluster_name":"v7.3.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} |  
  | at __randomizedtesting.SeedInfo.seed([4A1898BDB88606D4:27F56B7048E79122]:0) |  
  | at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) |  
  | at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:752) |  
  | at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:158)

org.elasticsearch.xpack.restart.FullClusterRestartIT > testWatcher FAILED   
    org.elasticsearch.client.ResponseException: method [GET], host [http://127.0.0.1:37476/], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout]  
    {"cluster_name":"v7.3.1","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0} 
        at __randomizedtesting.SeedInfo.seed([4A1898BDB88606D4:27F56B7048E79122]:0) 
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)    
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:313) 
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) 
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.waitForYellow(FullClusterRestartIT.java:752)   
        at app//org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:158)
andreidan commented 1 year ago

Another failure https://gradle-enterprise.elastic.co/s/htn2dtr4ajl5c

This one seems to be a genuine test failure

            java.lang.AssertionError: 
11:24:54     Expected: <2>
11:24:54          but: was <3>
11:24:54         at __randomizedtesting.SeedInfo.seed([8299F6568D3BC8E1:EF74059B7D5A5F17]:0)
11:24:54         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
11:24:54         at org.junit.Assert.assertThat(Assert.java:956)
11:24:54         at org.junit.Assert.assertThat(Assert.java:923)
11:24:54         at org.elasticsearch.xpack.restart.FullClusterRestartIT.assertBasicWatchInteractions(FullClusterRestartIT.java:722)
11:24:54         at org.elasticsearch.xpack.restart.FullClusterRestartIT.testWatcher(FullClusterRestartIT.java:200)
thecoop commented 1 year ago

And another https://gradle-enterprise.elastic.co/s/snmnlk6itki4w, timeout this time

org.elasticsearch.client.ResponseException: method [GET], host [http://[::1]:33149], URI [/_cluster/health/.watches,bwc_watch_index,.watcher-history*?wait_for_no_relocating_shards=true&wait_for_no_initializing_shards=true&timeout=30s&wait_for_status=yellow], status line [HTTP/1.1 408 Request Timeout 
{"cluster_name":"elasticsearch","status":"red","timed_out":true,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":2,"active_shards":4,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
cbuescher commented 1 year ago

Looks like this fits here: https://gradle-enterprise.elastic.co/s/tb7rzfmtfukac

elasticsearchmachine commented 11 months ago

Pinging @elastic/es-data-management (Team:Data Management)

cbuescher commented 10 months ago

More like this: https://gradle-enterprise.elastic.co/s/dw26o7qt6or7c/tests/task/:x-pack:qa:full-cluster-restart:v7.1.1%23bwcTest/details/org.elasticsearch.xpack.restart.FullClusterRestartIT?top-execution=1

masseyke commented 6 months ago

I think that the version is getting incremented because watcher sometimes happens to run, and it updates the document in .watches with status fields. By default it does this every 500 ms (xpack.watcher.trigger.schedule.ticker.tick_interval), so it seems believable that every once in a while this happens while the test is running. I'm going to change the test to assert that the version is at least 2.

masseyke commented 6 months ago

The timeout failure is more interesting though. From the latest failure, it appears to be happening because the bwc_watch_index, so the health check fails (the other two indices exist, according to the cluster state that's dumped out on failure). I have no idea why bwc_watch_index would not exist though. I can see that the bwc_watch was created. And according to the watcher stats that are dumped out on failure, watcher is up and running, and has run 29 watches (which probably maps to running this watch once per second for the 29 times it could before the 30-second timeout). There is no mention of bwc_watch failing, and no mention of bwc_watch_index being created or failing to be created. I haven't been able to reproduce this locally.

masseyke commented 5 months ago

I realized I hadn't been paying attention when I said that the watch had run 29 times -- the threadpool max size was 29. In the run that failed there were only 2 watches in the watch stats. The test creates 3 watches. I dumped out the watcher stats in a successful run, and one node showed 2 watches and the other 1 watch (as you'd expect). So it appears that for some reason in this timeout run the bwc_watch silently failed to be created in some way.