Pod in crashloopbackoff coz of bulletinboard unexpected error

RedisLabs / redis-enterprise-k8s-docs

151 stars 89 forks source link

Pod in crashloopbackoff coz of bulletinboard unexpected error #239

Closed cschockaert closed 1 year ago

cschockaert commented 2 years ago

Hello

one of my cluster node pod is in crash coz of:

time="2022-07-07T13:02:48Z" level=error msg="could not find node by name in bulletinboard: gke-development-cluste-main-n1std8-v1-3059a148-mahz"

i think it's because i changed the node selector and tolerator of the REC. so the buletin board dont add the previous existing node pool (old pod are on the previous nodepool) which are not in the new bulletin board.

cschockaert commented 2 years ago

Seems that the cluster trigger deployment is updated to the new node pool, but the cluster statefulset is still on the old node pool, and if one sts pod need to restart it can go to a node that is unknown in the bulletin board... keeping state in a infinite loop situation..

cschockaert commented 2 years ago

Not sure if it's the cluster trigger pod that is generating the bulletin board or the operator, but if we use a node selector i think we can go in a situation where not all VM GKE nodes are added to the bulletin .. causing big troubles

alexvasseur commented 1 year ago

We reviewed this internally and this seems that you were doing a change of node pool as a one-time change for dev/test env. We recommend to approach this with a new setup as hot-change of node pool is not a validated scenario. There are other approaches should you need to migrate a production system later. We can clause this one @laurentdroin - kindly reach out to Redis teams as required.

cschockaert commented 1 year ago

Yep, thanks, i will never do that in production, since it's not working in dev :)