elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.09k stars 24.83k forks source link

[CI] StableMasterDisruptionIT testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable failing #89507

Closed kingherc closed 2 years ago

kingherc commented 2 years ago

Although original scan is from another branch, I can reproduce it sometimes on main (commit 7d7332afad6bb47d9e23fb05adc3407ada7fada1).

Build scan: https://gradle-enterprise.elastic.co/s/pr5ajotlfldxe/tests/:server:internalClusterTest/org.elasticsearch.discovery.StableMasterDisruptionIT/testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable

Reproduction line: ./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.StableMasterDisruptionIT.testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable" -Dtests.seed=16A056C44AC79D5D -Dtests.locale=sr-ME -Dtests.timezone=Asia/Urumqi -Druntime.java=17

Applicable branches: main

Reproduces locally?: Yes

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.StableMasterDisruptionIT&tests.test=testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable

Failure excerpt:

java.lang.AssertionError: {"status":"yellow","cluster_name":"TEST-TEST_WORKER_VM=[388]-CLUSTER_SEED=[-6016389917722388361]-HASH=[DB766673DF]-cluster","indicators":{"master_is_stable":{"status":"yellow","symptom":"The cluster's master has alternated between [{node_t0}{NeQPMpU0TLWXGVk6QjUyzg}{29qalSL1TUu1azWd84HDSA}{node_t0}{127.0.0.1}{127.0.0.1:24971}{m}] and no master multiple times in the last 30m","details":{"current_master":{"node_id":"NeQPMpU0TLWXGVk6QjUyzg","name":"node_t0"},"recent_masters":[{"node_id":"NeQPMpU0TLWXGVk6QjUyzg","name":"node_t0"},{"node_id":"NeQPMpU0TLWXGVk6QjUyzg","name":"node_t0"},{"node_id":"NeQPMpU0TLWXGVk6QjUyzg","name":"node_t0"}],"exception_fetching_history":{"message":"[node_t0][127.0.0.1:24971][internal:cluster/master_history/get] request_id [23] timed out after [10112ms]","stack_trace":"org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_t0][127.0.0.1:24971][internal:cluster/master_history/get] request_id [23] timed out after [10112ms]\n"}},"impacts":[{"severity":1,"description":"The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.","impact_areas":["ingest"]},{"severity":1,"description":"Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.","impact_areas":["deployment_management"]},{"severity":3,"description":"Snapshot and restore will not work. Searchable snapshots cannot be mounted.","impact_areas":["backup"]}],"diagnosis":[{"cause":"The Elasticsearch cluster does not have a stable master node.","action":"Get help at https://ela.st/getting-help","help_url":"https://ela.st/getting-help"}]},"repository_integrity":{"status":"unknown","symptom":"Could not determine health status. Check details on critical issues preventing the health status from reporting.","details":{"reasons":{"master_is_stable":"yellow"}}},"shards_availability":{"status":"unknown","symptom":"Could not determine health status. Check details on critical issues preventing the health status from reporting.","details":{"reasons":{"master_is_stable":"yellow"}}}}}
Expected: <GREEN>
     but: was <YELLOW>

  at __randomizedtesting.SeedInfo.seed([16A056C44AC79D5D:FD8BA8657148FC94]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.elasticsearch.discovery.StableMasterDisruptionIT.lambda$assertMasterStability$0(StableMasterDisruptionIT.java:140)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1104)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1077)
  at org.elasticsearch.discovery.StableMasterDisruptionIT.assertMasterStability(StableMasterDisruptionIT.java:137)
  at org.elasticsearch.discovery.StableMasterDisruptionIT.assertGreenMasterStability(StableMasterDisruptionIT.java:133)
  at org.elasticsearch.discovery.StableMasterDisruptionIT.testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable(StableMasterDisruptionIT.java:474)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:568)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)
elasticsearchmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)

masseyke commented 2 years ago

Closing as a duplicate of #89431 (which has a fix going in soon).