Open pebrc opened 2 years ago
I have reached out to the Elasticsearch team to figure out if we can improve something in the node shutdown API. What I need to understand better still is why our own check https://github.com/elastic/cloud-on-k8s/blob/24ab02e487b04f298825d48b146a7a50f47823ac/pkg/controller/elasticsearch/driver/upgrade_predicates.go#L648 did not protect us in this scenario. All preconditions apply: it is a yellow cluster state, we have a version upgrade.
So regarding the check for shard activity it looks like the cluster health API request timed out:
Jun 17, 2022 @ 07:18:09.543 task [cluster_health (wait_for_events [LANGUID])] timed out after [0s]
{"log.level":"debug","@timestamp":"2022-06-17T07:18:09.539Z","log.logger":"elasticsearch-client","message":"Elasticsearch HTTP request","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.versi
on":"1.4.0","method":"GET","url":"https://test-version-upgrade-to-8x-r6p5-es-internal-http.e2e-fusqo-mercury.svc:9200/_cluster/health?wait_for_events=languid&timeout=0s","namespace":"e2e-fusqo-mercury","es_name":"test-version-upgrade-to-8x-r6p5"}
# parse error here is misleading just means we got a health response with an non-200 HTTP code
{"log.level":"error","@timestamp":"2022-06-17T07:18:09.545Z","log.logger":"elasticsearch-client","message":"Cannot parse Elasticsearch error response body","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","error":"--useless stack trace elided---"}
But the operator seems to have continued. I would have expected the if_yellow_only_restart_upgrading_nodes_with_unassigned_replicas
to save us here. I wonder if the cluster was not (yet) reporting yellow health.
In build/4899, there was exactly 10 x Elasticsearch cluster health check failure at ...: cluster health red
, and in #7000 we set the limit to 10
, so it should have worked, but I didn't see there may also be 1 x error while creating the Elasticsearch client
, which is counted in continuousHealthChecks.FailureCount
.
Let's set 15
to get a little margin but stay relatively low.
Despite #7012, we had a new recent failure.
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/324
=== RUN TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:43.223407119 +0000 UTC m=+2275.455997811: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:46.209786122 +0000 UTC m=+2278.442376802: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:49.222156744 +0000 UTC m=+2281.454747428: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:52.208771689 +0000 UTC m=+2284.441362371: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:55.21755595 +0000 UTC m=+2287.450146644: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:58.22815551 +0000 UTC m=+2290.460746201: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:01.224268657 +0000 UTC m=+2293.456859353: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:04.215287808 +0000 UTC m=+2296.447878500: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:07.217946796 +0000 UTC m=+2299.450537482: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:10.214928657 +0000 UTC m=+2302.447519343: cluster health red
steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:13.228691444 +0000 UTC m=+2305.461282139: cluster health red
{
"log.level": "error",
"@timestamp": "2023-10-06T23:37:43.291Z",
"message": "continuing with additional tests",
"service.version": "0.0.0-SNAPSHOT+00000000",
"service.type": "eck",
"ecs.version": "1.4.0",
"error": "test failure",
"error.stack_trace": "github.com/elastic/cloud-on-k8s/v2/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:52\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/agent.TestAgentVersionUpgradeToLatest8x\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/upgrade_test.go:60\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1595"
}
{
"log.level": "info",
"@timestamp": "2023-10-06T23:37:43.296Z",
"log.logger": "e2e",
"message": "Running eck-diagnostics",
"service.version": "0.0.0-SNAPSHOT+00000000",
"service.type": "eck",
"ecs.version": "1.4.0",
"cluster": "eck-e2e-stack-8-7-1-mwnv-324",
"test": "TestAgentVersionUpgradeToLatest8x",
"step": "Elasticsearch cluster health should not have been red during mutation process"
}
I1006 23:39:56.906242 17601 request.go:697] Waited for 1.046242504s due to client-side throttling, not priority and fairness, request: GET:https://10.95.176.1:443/apis/enterprisesearch.k8s.elastic.co/v1
error while creating the Elasticsearch client: secrets "test-agent-upgrade-xxhl-es-elastic-user" not found
{
"log.level": "error",
"@timestamp": "2023-10-06T23:39:59.162Z",
"log.logger": "e2e",
"message": "while deleting elastic resources",
"service.version": "0.0.0-SNAPSHOT+00000000",
"service.type": "eck",
"ecs.version": "1.4.0",
"error": "while deleting secret test-agent-upgrade-xxhl-es-masterdata-es-config in namespace e2e-u3xme-mercury: secrets \"test-agent-upgrade-xxhl-es-masterdata-es-config\" not found",
"error.stack_trace": "github.com/elastic/cloud-on-k8s/v2/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:61\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/agent.TestAgentVersionUpgradeToLatest8x\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/upgrade_test.go:60\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1595"
}
Code of the step where we stop the continuousHealthChecks
, log all the failures and assert.LessOrEqual
the number of failures:
https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/elasticsearch/steps_mutation.go#L151-L165
The other part of code to have in mind is what we do when a test fails to continue the other tests:
What's strange is that according to the log we got 11:
Elasticsearch cluster health check failure at xxxxx: cluster health red
.There is also 1 error while creating the Elasticsearch client
but I'm not sure it is counted because it is not logged with Elasticsearch cluster health check failure
.
We tolerate 15 failures:
https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/elasticsearch/builder.go#L30 https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/elasticsearch/steps_mutation.go#L163
I noticed that we don't use require
like in all other tests but assert
here. But I don't see how that explains what's happening.
We fail the test as soon as there is a failure using t.Errorf
despite our tolerance mechanism. 🤦
https://github.com/elastic/cloud-on-k8s/pull/7358 should fix this.
But not the flaky test:
https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/381
steps_mutation.go:159: ContinuousHealthChecks failures count (52) is above the tolerance (15): 52 x [cluster health red]
52 seems very high.
Related to https://github.com/elastic/cloud-on-k8s/issues/5040 but now with node shutdown API in use:
https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-gke-k8s-versions/754//testReport
Rolling upgrade from 8.2.0 to 8.2.2
Timeline:
The way I read this sequence of events is that we started a rolling upgrade only 3 seconds after the ILM history index had been created and apparently its replica was not ready yet when we took down the first node to upgrade. However my expectation would be that the node shutdown API should not allow a node to go down if the only remaining replica is not yet initialised.