elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
51 stars 707 forks source link

Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process (revisited) #5795

Open pebrc opened 2 years ago

pebrc commented 2 years ago

Related to https://github.com/elastic/cloud-on-k8s/issues/5040 but now with node shutdown API in use:

https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-gke-k8s-versions/754//testReport

Rolling upgrade from 8.2.0 to 8.2.2

Timeline:

The way I read this sequence of events is that we started a rolling upgrade only 3 seconds after the ILM history index had been created and apparently its replica was not ready yet when we took down the first node to upgrade. However my expectation would be that the node shutdown API should not allow a node to go down if the only remaining replica is not yet initialised.

pebrc commented 2 years ago

I have reached out to the Elasticsearch team to figure out if we can improve something in the node shutdown API. What I need to understand better still is why our own check https://github.com/elastic/cloud-on-k8s/blob/24ab02e487b04f298825d48b146a7a50f47823ac/pkg/controller/elasticsearch/driver/upgrade_predicates.go#L648 did not protect us in this scenario. All preconditions apply: it is a yellow cluster state, we have a version upgrade.

pebrc commented 2 years ago

So regarding the check for shard activity it looks like the cluster health API request timed out:

Jun 17, 2022 @ 07:18:09.543 task [cluster_health (wait_for_events [LANGUID])] timed out after [0s]
{"log.level":"debug","@timestamp":"2022-06-17T07:18:09.539Z","log.logger":"elasticsearch-client","message":"Elasticsearch HTTP request","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.versi
on":"1.4.0","method":"GET","url":"https://test-version-upgrade-to-8x-r6p5-es-internal-http.e2e-fusqo-mercury.svc:9200/_cluster/health?wait_for_events=languid&timeout=0s","namespace":"e2e-fusqo-mercury","es_name":"test-version-upgrade-to-8x-r6p5"}
# parse error here is misleading just means we got a health response with an non-200 HTTP code
{"log.level":"error","@timestamp":"2022-06-17T07:18:09.545Z","log.logger":"elasticsearch-client","message":"Cannot parse Elasticsearch error response body","service.version":"2.4.0-SNAPSHOT+903d77f6","service.type":"eck","ecs.version":"1.4.0","error":"--useless stack trace elided---"}

But the operator seems to have continued. I would have expected the if_yellow_only_restart_upgrading_nodes_with_unassigned_replicas to save us here. I wonder if the cluster was not (yet) reporting yellow health.

thbkrkr commented 1 year ago

In build/4899, there was exactly 10 x Elasticsearch cluster health check failure at ...: cluster health red, and in #7000 we set the limit to 10, so it should have worked, but I didn't see there may also be 1 x error while creating the Elasticsearch client, which is counted in continuousHealthChecks.FailureCount.

https://github.com/elastic/cloud-on-k8s/blob/3e40d5b98fd02ed8371d693692c3a03342dc4e8c/test/e2e/test/elasticsearch/steps_mutation.go#L254-L257

https://github.com/elastic/cloud-on-k8s/blob/9ef2aca811878c390c332be308fd13dd8119283d/test/e2e/test/elasticsearch/steps_mutation.go#L233-L238

Let's set 15 to get a little margin but stay relatively low.

thbkrkr commented 1 year ago

Despite #7012, we had a new recent failure.

https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/324

=== RUN   TestAgentVersionUpgradeToLatest8x/Elasticsearch_cluster_health_should_not_have_been_red_during_mutation_process
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:43.223407119 +0000 UTC m=+2275.455997811: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:46.209786122 +0000 UTC m=+2278.442376802: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:49.222156744 +0000 UTC m=+2281.454747428: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:52.208771689 +0000 UTC m=+2284.441362371: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:55.21755595 +0000 UTC m=+2287.450146644: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:35:58.22815551 +0000 UTC m=+2290.460746201: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:01.224268657 +0000 UTC m=+2293.456859353: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:04.215287808 +0000 UTC m=+2296.447878500: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:07.217946796 +0000 UTC m=+2299.450537482: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:10.214928657 +0000 UTC m=+2302.447519343: cluster health red
    steps_mutation.go:160: Elasticsearch cluster health check failure at 2023-10-06 23:36:13.228691444 +0000 UTC m=+2305.461282139: cluster health red

{
    "log.level": "error",
    "@timestamp": "2023-10-06T23:37:43.291Z",
    "message": "continuing with additional tests",
    "service.version": "0.0.0-SNAPSHOT+00000000",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "error": "test failure",
    "error.stack_trace": "github.com/elastic/cloud-on-k8s/v2/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:52\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/agent.TestAgentVersionUpgradeToLatest8x\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/upgrade_test.go:60\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1595"
}
{
    "log.level": "info",
    "@timestamp": "2023-10-06T23:37:43.296Z",
    "log.logger": "e2e",
    "message": "Running eck-diagnostics",
    "service.version": "0.0.0-SNAPSHOT+00000000",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "cluster": "eck-e2e-stack-8-7-1-mwnv-324",
    "test": "TestAgentVersionUpgradeToLatest8x",
    "step": "Elasticsearch cluster health should not have been red during mutation process"
}

I1006 23:39:56.906242   17601 request.go:697] Waited for 1.046242504s due to client-side throttling, not priority and fairness, request: GET:https://10.95.176.1:443/apis/enterprisesearch.k8s.elastic.co/v1

error while creating the Elasticsearch client: secrets "test-agent-upgrade-xxhl-es-elastic-user" not found
{
    "log.level": "error",
    "@timestamp": "2023-10-06T23:39:59.162Z",
    "log.logger": "e2e",
    "message": "while deleting elastic resources",
    "service.version": "0.0.0-SNAPSHOT+00000000",
    "service.type": "eck",
    "ecs.version": "1.4.0",
    "error": "while deleting secret test-agent-upgrade-xxhl-es-masterdata-es-config in namespace e2e-u3xme-mercury: secrets \"test-agent-upgrade-xxhl-es-masterdata-es-config\" not found",
    "error.stack_trace": "github.com/elastic/cloud-on-k8s/v2/test/e2e/test.StepList.RunSequential\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/step.go:61\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/test.RunMutations\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/test/run_mutation.go:39\ngithub.com/elastic/cloud-on-k8s/v2/test/e2e/agent.TestAgentVersionUpgradeToLatest8x\n\t/go/src/github.com/elastic/cloud-on-k8s/test/e2e/agent/upgrade_test.go:60\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1595"
}

Code of the step where we stop the continuousHealthChecks, log all the failures and assert.LessOrEqual the number of failures: https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/elasticsearch/steps_mutation.go#L151-L165

The other part of code to have in mind is what we do when a test fails to continue the other tests:

https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/step.go#L51-L63

What's strange is that according to the log we got 11:

There is also 1 error while creating the Elasticsearch client but I'm not sure it is counted because it is not logged with Elasticsearch cluster health check failure.

https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/elasticsearch/steps_mutation.go#L254-L257

We tolerate 15 failures:

https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/elasticsearch/builder.go#L30 https://github.com/elastic/cloud-on-k8s/blob/7855840ea11bfe9baf2c66a394f0fddebe7cdbcf/test/e2e/test/elasticsearch/steps_mutation.go#L163

I noticed that we don't use require like in all other tests but assert here. But I don't see how that explains what's happening.

thbkrkr commented 11 months ago

We fail the test as soon as there is a failure using t.Errorf despite our tolerance mechanism. 🤦

https://github.com/elastic/cloud-on-k8s/blob/58b5a7302240500f0fd3d0ab13dde2d1385470b6/test/e2e/test/elasticsearch/steps_mutation.go#L159-L161

https://github.com/elastic/cloud-on-k8s/pull/7358 should fix this.

thbkrkr commented 11 months ago

But not the flaky test:

https://buildkite.com/elastic/cloud-on-k8s-operator-nightly/builds/381

steps_mutation.go:159: ContinuousHealthChecks failures count (52) is above the tolerance (15): 52 x [cluster health red]

52 seems very high.