BobVanB commented 3 weeks ago

The main problem

When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.

This piece of code can be empty:

NODE_ID=$(grep "$POD_NAME" "$resp_body" | cut -f 1 -d ' ')

Result

There is no NODE_ID and the request is broken between _nodes and shutdown

{"@timestamp": "2024-06-11T09:37:05+00:00", "message": "400 http://<cluster>-es-internal-http.<namespace>.svc:9200/_nodes//shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

The pod that is restarting will crash each time with error_exit "failed to call node shutdown API" and a shutdown is never called. Thus resulting in recreating the same pod again and starting allover from the top.

What i still want to know

Is the node removed before calling _cat/nodes

When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the _cat/nodes query? Or is it possible that the query ends op on the terminated node and doesn't give a result.

This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.

request -X GET "${ES_URL}/_cat/nodes?full_id=true&h=id,name"

Why is terminationGracePeriodSeconds way less then possible script run time?

The default terminationGracePeriodSeconds is 180 seconds. The scripts has also has 2 retry 10 calls, witch has count ** 2 as wait. This can result in alot of wait time: round 1: 1 second round 2: 1 second of previous round + 1 + 2 = 4 seconds round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds ... round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutes

should the terminationGracePeriodSeconds set to 30 minutes?
should the retry 10 be way less, something like retry 8 and get "retry 8/8 exited 1, no more retries left"

What has been done

After some debugging and trying to understand the code, i ended up cleaning it up a little and used shellcheck. I tried to not rewrite it all
WONT: build a retry loop to get the NODE_ID Want to know if this should use a retry 3 or just error_exit "failed to retrieve node ID" After cleanup, looks like this was not needed.
Use spaces instead of \t, this will ensure a readability inside the configmap.
Retry to 8

PoC Result

Added some debug information to prove that the script is working. Will add that it is not fun to debug the bash script without 'set -x'.

{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving nodes", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "resp_body: /tmp/tmp.k6cZwbNtph", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "NODE_ID: h3WUy....aTV9qjl7w", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "success to retrieve node id", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "retrieving shutdown request", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "check shutdown response", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-06-12T06:01:27+00:00", "message": "delaying termination for 50 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}

What has not been done

...

prepare-fs.sh
readiness-probe-script.sh
suspend.sh

thbkrkr commented 6 days ago

buildkite test this

thbkrkr commented 6 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr commented 5 days ago

This breaks two e2e tests:

TestMutationResizeMemoryDown/Stopping_to_watch_for_correct_node_shutdown_API_usage
TestMutationResizeMemoryUp/Stopping_to_watch_for_correct_node_shutdown_API_usage

I don't know exactly what's going on yet. The failure is because during the mutation, /_nodes/shutdown returned more than one entry:

[
    {Q3SwszElnHxaJg RESTART pre-stop hook 1719495839993 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
    {eUcnfdK-Q3SwszElnHxaJg RESTART 70382 1719495839494 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
]

thbkrkr commented 5 days ago

The pre-stop hook incorrectly extracted the node id, which created 2 shutdown records with different ids (Q3SwszElnHxaJg and eUcnfdK-Q3SwszElnHxaJg).

thbkrkr commented 5 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

BobVanB commented 4 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr commented 4 days ago

buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT

thbkrkr commented 2 days ago

Thank you @BobVanB!

elastic / cloud-on-k8s

fix: cleanup pre-stop-hook-script.sh because NODE_ID can be empty #7892

The main problem

Result

What i still want to know

Is the node removed before calling _cat/nodes

Why is terminationGracePeriodSeconds way less then possible script run time?

What has been done

PoC Result

What has not been done