Open BobVanB opened 3 weeks ago
buildkite test this
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT
This breaks two e2e tests:
TestMutationResizeMemoryDown/Stopping_to_watch_for_correct_node_shutdown_API_usage
TestMutationResizeMemoryUp/Stopping_to_watch_for_correct_node_shutdown_API_usage
I don't know exactly what's going on yet.
The failure is because during the mutation, /_nodes/shutdown
returned more than one entry:
[
{Q3SwszElnHxaJg RESTART pre-stop hook 1719495839993 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
{eUcnfdK-Q3SwszElnHxaJg RESTART 70382 1719495839494 COMPLETE {COMPLETE 0 no shard relocation is necessary for a node restart} {COMPLETE} {COMPLETE}}
]
The pre-stop hook incorrectly extracted the node id, which created 2 shutdown records with different ids (Q3SwszElnHxaJg and eUcnfdK-Q3SwszElnHxaJg).
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT
buildkite test this -f p=gke,E2E_TAGS=es -m s=7.17.8,s=8.14.1,s=8.15.0-SNAPSHOT
Thank you @BobVanB!
The main problem
When upgrading a image with some other plugin, the operator will terminate each pod and try to remove it from the ES-cluster.
This piece of code can be empty:
Result
_nodes
andshutdown
error_exit "failed to call node shutdown API"
and a shutdown is never called. Thus resulting in recreating the same pod again and starting allover from the top.What i still want to know
Is the node removed before calling _cat/nodes
When the node is terminated and the pre-stop-hook-script.sh is called, is it possible that the node is already removed from the
_cat/nodes
query? Or is it possible that the query ends op on the terminated node and doesn't give a result.This piece of code returns the list of nodes and i wonder if the pod is terminated the node is actually already not present in this list from active nodes. Still no basis for this claim, but i have not confirmed if the NODE_ID is empty because the other nodes in the cluster don't see the node that is terminated.
Why is terminationGracePeriodSeconds way less then possible script run time?
The default terminationGracePeriodSeconds is 180 seconds. The scripts has also has 2
retry 10
calls, witch hascount ** 2
as wait. This can result in alot of wait time: round 1: 1 second round 2: 1 second of previous round + 1 + 2 = 4 seconds round 3: 4 seconds of previous rounds + 1 + 2 + 4 = 11 seconds ... round 9: 502 seconds of the previous rounds + 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 seconds = +- 17 minutesretry 10
be way less, something likeretry 8
and get "retry 8/8 exited 1, no more retries left"What has been done
retry 3
or justerror_exit "failed to retrieve node ID"
After cleanup, looks like this was not needed.PoC Result
Added some debug information to prove that the script is working. Will add that it is not fun to debug the bash script without 'set -x'.
What has not been done
...