Closed slobodanadamovic closed 2 years ago
Pinging @elastic/es-distributed (Team:Distributed)
This does not reproduces locally, however it seems happening mostly on ci-immutable-windows
that are a slower ones.
I was able to force this situation by adding a sleep before https://github.com/elastic/elasticsearch/blob/352a688b041746a669879022b6b1934f8a011892/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java#L664 that populates response map that is later used to build an error message. I suspect the failure is happening when test is running on a slow hardware and a nonActiveMasterNode
is not receiving a response in time.
Related, I plan to do the same thing for testNoQuorum
that I did for CoordinationDiagnosticsServiceIT:: testBlockClusterStateProcessingOnOneNode
here: #89001. Either one (#89064 or the one similar to #89001 that does not exist yet) would fix the problem but I think it will be good to have both.
Build scan: https://gradle-enterprise.elastic.co/s/swesmanp5gnfo/tests/:server:internalClusterTest/org.elasticsearch.discovery.StableMasterDisruptionIT/testNoQuorum
Reproduction line:
gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.StableMasterDisruptionIT.testNoQuorum" -Dtests.seed=935714954462D421 -Dtests.locale=vi -Dtests.timezone=Pacific/Pitcairn -Druntime.java=18
Applicable branches: main
Reproduces locally?: Didn't try
Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.StableMasterDisruptionIT&tests.test=testNoQuorum
Failure excerpt: