galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
42 stars 39 forks source link

Helm delete leaves some remanent behind #403

Open pcm32 opened 1 year ago

pcm32 commented 1 year ago

Doing:

$ helm list
NAME            NAMESPACE   REVISION    UPDATED                                 STATUS      CHART                           APP VERSION
galaxy-dev      default     54          2022-12-16 15:55:08.025324 +0000 UTC    deployed    galaxy-5.3.1                    22.05
netdata         default     1           2022-11-16 15:52:08.753301 +0000 UTC    deployed    netdata-3.7.33                  v1.36.1
nfs-provisioner default     3           2022-12-15 14:08:44.824752 +0000 UTC    deployed    nfs-server-provisioner-1.4.0    3.0.0
$ helm delete galaxy-dev

leaves behind some constructs:

kubectl get pods
NAME                                       READY   STATUS    RESTARTS        AGE
galaxy-dev-rabbitmq-server-server-0        1/1     Running   0               16d

kubectl get statefulset
NAME                                     READY   AGE
galaxy-dev-rabbitmq-server-server        1/1     16d
galaxy-galaxy-dev-postgres               0/1     2d19h

kubectl get svc
NAME                                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                                                                                     AGE
galaxy-dev-rabbitmq-server               ClusterIP   10.43.96.40    <none>        5672/TCP,15672/TCP,15692/TCP                                                                                17d
galaxy-dev-rabbitmq-server-nodes         ClusterIP   None           <none>        4369/TCP,25672/TCP                                                                                          17d
galaxy-galaxy-dev-postgres               ClusterIP   10.43.44.203   <none>        5432/TCP                                                                                                    32d
galaxy-galaxy-dev-postgres-config        ClusterIP   None           <none>        <none>                                                                                                      32d
galaxy-galaxy-dev-postgres-repl          ClusterIP   10.43.13.175   <none>        5432/TCP                                                                                                    32d

kubectl get RabbitmqCluster
NAME                         ALLREPLICASREADY   RECONCILESUCCESS   AGE
galaxy-dev-rabbitmq-server   True               True               17d

kubectl get cm
NAME                                      DATA   AGE
galaxy-dev-rabbitmq-server-plugins-conf   1      17d
galaxy-dev-rabbitmq-server-server-conf    2      17d

that I could find.

I suspect issuing:

kubectl delete RabbitmqCluster/galaxy-dev-rabbitmq-server

should get rid of the rabbitmq components, but I don't seem to find a unified call to bring down the postgres components (none of the specific API objects for postgres seem present, besides the statefulset and pods).

pcm32 commented 1 year ago

For rabbitmq cluster to finish I had to remove the the finalizer through a kubectl edit. I understand that this might have unintended consequences of leaving stale objects in etcd.

For postgres, running:

kubectl delete statefulset/galaxy-galaxy-dev-postgres

seems to have worked. Will PR a few changes to the README.

pcm32 commented 1 year ago

I can confirm that a new chart can be started with the same name after those deletions.

pcm32 commented 1 year ago

...mmm... I also discovered a rogue secret left there hanging.

After all of this, I still cannot start a new deployment with the same name. I get an endless wait on the database to start:

NAME                                                              READY   STATUS     RESTARTS        AGE
galaxy-dev-celery-beat-bbdb788df-lzh4k                            0/1     Init:0/1   0               2m53s
galaxy-dev-celery-fd5b894f8-rmffb                                 0/1     Init:0/1   0               2m53s
galaxy-dev-init-db-bfrrx-kfq2z                                    0/1     Init:0/1   0               2m53s
galaxy-dev-job-0-df446dc6f-t7wsj                                  0/1     Init:0/1   0               2m53s
galaxy-dev-nginx-75fc94497f-45nvp                                 1/1     Running    0               2m53s
galaxy-dev-postgres-77d867c998-dcxfm                              1/1     Running    0               2m53s
galaxy-dev-rabbitmq-865b44f65f-cr87f                              1/1     Running    0               2m53s
galaxy-dev-rabbitmq-messaging-topology-operator-7b67965f94gpj86   1/1     Running    0               2m53s
galaxy-dev-rabbitmq-server-server-0                               1/1     Running    0               2m12s
galaxy-dev-tusd-6bf6456765-qm6zh                                  1/1     Running    0               2m53s
galaxy-dev-web-64646d6d6f-rvd5t                                   0/1     Init:0/1   0               2m53s
galaxy-dev-workflow-57b8d8f6f7-lppzz                              0/1     Init:0/1   0               2m53s
galaxy-galaxy-dev-postgres-0                                      0/1     Running    0               2m13s

(I have seen this with more than 30 minutes going on). Something seems to keep postgres without defining a leader:

kubectl logs -f galaxy-galaxy-dev-postgres-0
2022-12-19 17:51:50,379 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-12-19 17:51:50,996 - bootstrapping - INFO - Looks like your running openstack
2022-12-19 17:51:51,037 - bootstrapping - INFO - Configuring pgqd
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring pgbouncer
2022-12-19 17:51:51,038 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring wal-e
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring bootstrap
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring certificate
2022-12-19 17:51:51,038 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-12-19 17:51:51,116 - bootstrapping - INFO - Configuring log
2022-12-19 17:51:51,117 - bootstrapping - INFO - Configuring patroni
2022-12-19 17:51:51,131 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-12-19 17:51:51,132 - bootstrapping - INFO - Configuring crontab
2022-12-19 17:51:51,132 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-12-19 17:51:51,132 - bootstrapping - INFO - Configuring pam-oauth2
2022-12-19 17:51:51,133 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-12-19 17:51:51,133 - bootstrapping - INFO - Configuring standby-cluster
2022-12-19 17:51:52,420 INFO: Selected new K8s API server endpoint https://192.168.42.234:6443
2022-12-19 17:51:52,442 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-12-19 17:51:52,448 INFO: Lock owner: None; I am galaxy-galaxy-dev-postgres-0
2022-12-19 17:51:52,485 INFO: waiting for leader to bootstrap
2022-12-19 17:52:02,979 INFO: Lock owner: None; I am galaxy-galaxy-dev-postgres-0
2022-12-19 17:52:02,980 INFO: waiting for leader to bootstrap
2022-12-19 17:52:12,960 INFO: Lock owner: None; I am galaxy-galaxy-dev-postgres-0
2022-12-19 17:52:12,961 INFO: waiting for leader to bootstrap

it keeps looping on the last two lines.

nuwang commented 1 year ago

Part of the issue here is that some dependencies, especially operators, should not be removed before dependent CRs are removed. Otherwise, the operator is removed before it has a chance to cleanup the CR, leaving artefacts behind. We've discussed this in the past and one solution that came up was that we could perhaps separate the steps for installing operators, cvmfs-csi and other dependencies as a pre-installation step. In the end, the consensus was to keep things as is, but document that in production scenarios, one should manually install the operators, and set postgresql.deploy=False, rabbitmq.deploy=False etc. on the chart. The documentation still needs to be updated.

afgane commented 1 year ago

As a follow-up here, last week up updated the GalaxyKubeMan chart to explicitly remove some resources in an orderly fashion and enable clean deletion of a deployment: https://github.com/galaxyproject/galaxykubeman-helm/blob/anvil/galaxykubeman/templates/job-predelete.yaml