Orange-OpenSource / casskop

This Kubernetes operator automates the Cassandra operations such as deploying a new rack aware cluster, adding/removing nodes, configuring the C* and JVM parameters, upgrading JVM and C* versions, and many more...
https://orange-opensource.github.io/casskop/
Apache License 2.0
183 stars 54 forks source link

Update pod status to CR after setting podLastOperation to StatusFinalizing #377

Closed srteam2020 closed 2 years ago

srteam2020 commented 2 years ago
Q A
Bug fix? [YES]
New feature? []
API breaks? []
Deprecations? []
Related tickets fixes #370
License Apache 2.0

What's in this PR?

Immediately update pod status to CR after setting podLastOperation to StatusFinalizing to avoid decommissioned pod is deleted but podLastOperation is still StatusOngoing.

Why?

See #370

srteam2020 commented 2 years ago

CI fails due to:

INFO: Creating docker container
INFO: Copy goss files into container
INFO: Starting docker container
INFO: Container ID: dc0d38be
INFO: Found goss_wait.yaml, waiting for it to pass before running tests
oci runtime error: exec failed: cannot exec a container that has stopped

OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
intx ThreadPriorityPolicy=42 is outside the allowed range [ 0 ... 1 ]
Improperly specified VM option 'ThreadPriorityPolicy=42'
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
ERROR: goss_wait.yaml never passed
INFO: Deleting container
== deleteDgossVolumes
dgoss-bootstrap-vol
dgoss-extralib-vol
dgoss-configmap-vol
dgoss-tools-vol
make: *** [Makefile:434: dgoss-bootstrap] Error 1

I am not sure whether it is related to our PR

cscetbon commented 2 years ago

@srteam2020 looking at your branch I see you are way behind. Please rebase your changes off of master and the error should disappear.

fdehay commented 2 years ago

Hi @srteam2020, Could you please rebase another time on master? We pushed a fix today on the integration tests we need before accepting PRs Sorry for this Franck

srteam2020 commented 2 years ago

@cscetbon @fdehay I have rebased on master and all the checks have passed. Could you take a look at the PR again?

fdehay commented 2 years ago

hello @srteam2020, Sorry for the delay, we discussed your PR "internally" but I forgot to update. We are a little bit uncomfortable in forcing the update when it should happen later in the reconciliation process. But in the other hand a decommission is not something our ops team does often, they are more into creating clusters :) As I said, we have a very light team on this at this moment so I don't want to introduce potential bugs. Not saying no, just explaining our position and our silence. Haw critical is it for you? I understand you can reproduce the problem with a tool of yours?

fdehay commented 2 years ago

I forgot to say thanks for the rebase and the successfull tests :)

srteam2020 commented 2 years ago

Hello @fdehay

Haw critical is it for you? I understand you can reproduce the problem with a tool of yours?

This issue is quite critical as it causes not only resource leak but the controller itself also gets stuck in the dirty state and cannot make any progress. Even if the user wants to later scale up/down the cassandra cluster again, the controller will refuse to do that because it has returned with the error caused by pod not found before updating the statefulset. And yes we have a tool to reproduce it.

cscetbon commented 2 years ago

@fdehay do we move forward on this PR ?