elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.59k stars 24.63k forks source link

[CI] StableMasterDisruptionIT testNoQuorum failing #88931

Closed slobodanadamovic closed 2 years ago

slobodanadamovic commented 2 years ago

Build scan: https://gradle-enterprise.elastic.co/s/swesmanp5gnfo/tests/:server:internalClusterTest/org.elasticsearch.discovery.StableMasterDisruptionIT/testNoQuorum

Reproduction line: gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.StableMasterDisruptionIT.testNoQuorum" -Dtests.seed=935714954462D421 -Dtests.locale=vi -Dtests.timezone=Pacific/Pitcairn -Druntime.java=18

Applicable branches: main

Reproduces locally?: Didn't try

Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.StableMasterDisruptionIT&tests.test=testNoQuorum

Failure excerpt:

java.lang.AssertionError: {"status":"red","cluster_name":"TEST-TEST_WORKER_VM=[76]-CLUSTER_SEED=[3211026357706090361]-HASH=[245916A02B0]-cluster","indicators":{"master_is_stable":{"status":"red","symptom":"No master node observed in the last 1s, and the cause has not been determined.","details":{"current_master":{"node_id":null,"name":null},"recent_masters":[{"node_id":"EjF8KgPrRbWT5Z2e4iokag","name":"node_t0"}],"cluster_formation":{}},"impacts":[{"severity":1,"description":"The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.","impact_areas":["ingest"]},{"severity":1,"description":"Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.","impact_areas":["deployment_management"]},{"severity":3,"description":"Snapshot and restore will not work. Searchable snapshots cannot be mounted.","impact_areas":["backup"]}],"diagnosis":[{"cause":"The Elasticsearch cluster does not have a stable master node.","action":"Get help at https://ela.st/getting-help","help_url":"https://ela.st/getting-help"}]},"repository_integrity":{"status":"unknown","symptom":"Could not determine health status. Check details on critical issues preventing the health status from reporting.","details":{"reasons":{"master_is_stable":"red"}}},"shards_availability":{"status":"unknown","symptom":"Could not determine health status. Check details on critical issues preventing the health status from reporting.","details":{"reasons":{"master_is_stable":"red"}}}}}
Expected: a string containing "unable to form a quorum"
     but: was "No master node observed in the last 1s, and the cause has not been determined."

  at __randomizedtesting.SeedInfo.seed([935714954462D421:D4288B826E8661E0]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.elasticsearch.discovery.StableMasterDisruptionIT.lambda$assertMasterStability$0(StableMasterDisruptionIT.java:140)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1104)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1077)
  at org.elasticsearch.discovery.StableMasterDisruptionIT.assertMasterStability(StableMasterDisruptionIT.java:136)
  at org.elasticsearch.discovery.StableMasterDisruptionIT.testNoQuorum(StableMasterDisruptionIT.java:613)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
  at java.lang.reflect.Method.invoke(Method.java:577)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:833)
elasticsearchmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

hendrikmuhs commented 2 years ago

https://gradle-enterprise.elastic.co/s/edbf375u6dgdm

idegtiarenko commented 2 years ago

This does not reproduces locally, however it seems happening mostly on ci-immutable-windows that are a slower ones.

I was able to force this situation by adding a sleep before https://github.com/elastic/elasticsearch/blob/352a688b041746a669879022b6b1934f8a011892/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java#L664 that populates response map that is later used to build an error message. I suspect the failure is happening when test is running on a slow hardware and a nonActiveMasterNode is not receiving a response in time.

masseyke commented 2 years ago

Related, I plan to do the same thing for testNoQuorum that I did for CoordinationDiagnosticsServiceIT:: testBlockClusterStateProcessingOnOneNode here: #89001. Either one (#89064 or the one similar to #89001 that does not exist yet) would fix the problem but I think it will be good to have both.