Closed masseyke closed 2 years ago
Pinging @elastic/es-data-management (Team:Data Management)
I haven't been able to reproduce this one yet, but here's my guess as to what's happening: We intentionally force the master to simulate a long GC pause repeatedly in this test so that the other nodes think it has gone null (no master). Then we stop pausing the master, and we assert that the data nodes think the master has gone null too often, but the master thinks it has been fine. In this case it's returning this response:
{
"status": "yellow",
"cluster_name": "TEST-TEST_WORKER_VM=[226]-CLUSTER_SEED=[7924964341647394941]-HASH=[F908511566]-cluster",
"indicators": {
"master_is_stable": {
"status": "yellow",
"symptom": "The cluster's master has alternated between [{node_t0}{cX_Eos3KTJi7ejP3Ciy7Gw}{ZPwl9rtsTlmdxJanBAD2ew}{node_t0}{127.0.0.1}{127.0.0.1:20111}{m}] and no master multiple times in the last 30m",
"details": {
"current_master": {
"node_id": "cX_Eos3KTJi7ejP3Ciy7Gw",
"name": "node_t0"
},
"recent_masters": [
{
"node_id": "cX_Eos3KTJi7ejP3Ciy7Gw",
"name": "node_t0"
},
{
"node_id": "cX_Eos3KTJi7ejP3Ciy7Gw",
"name": "node_t0"
},
{
"node_id": "cX_Eos3KTJi7ejP3Ciy7Gw",
"name": "node_t0"
}
],
"exception_fetching_history": {
"message": "[node_t0][127.0.0.1:20111][internal:cluster/master_history/get] request_id [25] timed out after [10044ms]",
"stack_trace": "org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_t0][127.0.0.1:20111][internal:cluster/master_history/get] request_id [25] timed out after [10044ms]\n"
}
},
"impacts": [
{
"severity": 1,
"description": "The cluster cannot create, delete, or rebalance indices, and cannot insert or update documents.",
"impact_areas": [
"ingest"
]
},
{
"severity": 1,
"description": "Scheduled tasks such as Watcher, ILM, and SLM will not work. The _cat APIs will not work.",
"impact_areas": [
"deployment_management"
]
},
{
"severity": 3,
"description": "Snapshot and restore will not work. Searchable snapshots cannot be mounted.",
"impact_areas": [
"backup"
]
}
],
"diagnosis": [
{
"cause": "The Elasticsearch cluster does not have a stable master node.",
"action": "Get help at https://ela.st/getting-help",
"help_url": "https://ela.st/getting-help"
}
]
},
"repository_integrity": {
"status": "unknown",
"symptom": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
"details": {
"reasons": {
"master_is_stable": "yellow"
}
}
},
"shards_availability": {
"status": "unknown",
"symptom": "Could not determine health status. Check details on critical issues preventing the health status from reporting.",
"details": {
"reasons": {
"master_is_stable": "yellow"
}
}
}
}
}
There was a timeout while reaching out to the master to see if it thinks there is a problem. I'm guessing that this timeout came during a time when the master was paused (and that the random pause had lasted 10+ seconds). I noticed that in CoordinationDiagnosticsService#clusterChanged we only send the request for master history when the master node turns null, but not when it turns non-null after being null. So when the master node came back alive we didn't make a follow-up request. If that's right, there are two possible ways we could fix this: (1) Change clusterChanged() to also fetch the remote history when a master changes from null to not-null (maybe only if it's been non-null in our local history so that we don't make this request when the cluster first comes up). (2) Change the test to not have GC pauses longer than the request timeout for master history, and live with the fact that sometimes we'll get false positives in practice. Of those, # 1 probably sounds more appealing.
It looks like there are actually 3 bugs here: (1) We don't query for the remote master history if the master is non-null (described above) (2) The countDown latch was waiting for 2 of the 3 nodes to ack the missing master. That's not really a bug, but the master node is never going to see itself as null so it's pointless to even have code for that (3) The call to ensureStableMaster() was not specifying a node. In this case it was picking the master node to do the check from, and the master node always thinks it's fine. So it returned before the two data nodes had joined the master node. Then the next disruption would begin before the data nodes had joined the master, so they'd never receive a clusterChanged event that the master had gone null, so the countDown latch would have to wait for 30 seconds (a big waste of time).
This still fails on 8.4 https://gradle-enterprise.elastic.co/s/y645yuwowvuas
@masseyke Maybe we need backport the fix?
Sorry just saw this, and just backported it -- https://github.com/elastic/elasticsearch/pull/90040
Build scan: https://gradle-enterprise.elastic.co/s/sjbgu4zjbxhmo/tests/:server:internalClusterTest/org.elasticsearch.discovery.StableMasterDisruptionIT/testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable
Reproduction line:
./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.discovery.StableMasterDisruptionIT.testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable" -Dtests.seed=5F13EFF29906BB53 -Dtests.locale=en-IE -Dtests.timezone=Australia/South -Druntime.java=17
Applicable branches: main
Reproduces locally?: No
Failure history: https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.discovery.StableMasterDisruptionIT&tests.test=testRepeatedNullMasterRecognizedAsGreenIfMasterDoesNotKnowItIsUnstable
Failure excerpt: