elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.52k stars 686 forks source link

fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty #7892

Open BobVanB opened 3 weeks ago

BobVanB commented 3 weeks ago

The main problem

When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.

This piece of code can be empty:

NODE_ID=$(grep "$POD_NAME" "$resp_body" | cut -f 1 -d ' ')

Result

What i still want to know

Is the node removed before calling _cat/nodes

When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the _cat/nodes query? Or is it possible that the query ends op on the terminated node and doesn't give a result.

This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.

request -X GET "${ES_URL}/_cat/nodes?full_id=true&h=id,name"

Why is terminationGracePeriodSeconds way less then possible script run time?

The default terminationGracePeriodSeconds is 180 seconds. The scripts has also has 2 retry 10 calls, witch has count ** 2 as wait. This can result in alot of wait time: round 1: 1 second round 2: 1 second of previous round + 1 + 2 = 4 seconds round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds ... round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutes

  1. should the terminationGracePeriodSeconds set to 30 minutes?
  2. should the retry 10 be way less, something like retry 8 and get "retry 8/8 exited 1, no more retries left"

What has been done

PoC Result

Added some debug information to prove that the script is working. Will add that it is not fun to debug the bash script without 'set -x'.

{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving nodes", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "resp_body: /tmp/tmp.k6cZwbNtph", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "NODE_ID: h3WUy....aTV9qjl7w", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "success to retrieve node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving shutdown request", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "check shutdown response", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "delaying termination for 50 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

What has not been done

...

thbkrkr commented 6 days ago

buildkite test this

thbkrkr commented 6 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr commented 5 days ago

This breaks two e2e tests:

TestMutationResizeMemoryDown/Stopping_to_watch_for_correct_node_shutdown_API_usage
TestMutationResizeMemoryUp/Stopping_to_watch_for_correct_node_shutdown_API_usage

I don't know exactly what's going on yet. The failure is because during the mutation, /_nodes/shutdown returned more than one entry:

[
    {Q3SwszElnHxaJg RESTART pre-stop hook 1719495839993 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
    {eUcnfdK-Q3SwszElnHxaJg RESTART 70382 1719495839494 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
]
thbkrkr commented 5 days ago

The pre-stop hook incorrectly extracted the node id, which created 2 shutdown records with different ids (Q3SwszElnHxaJg and eUcnfdK-Q3SwszElnHxaJg).

thbkrkr commented 5 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

BobVanB commented 4 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr commented 4 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr commented 2 days ago

Thank you @BobVanB!