Open pcm32 opened 1 year ago
For rabbitmq cluster to finish I had to remove the the finalizer through a kubectl edit
. I understand that this might have unintended consequences of leaving stale objects in etcd
.
For postgres, running:
kubectl delete statefulset/galaxy-galaxy-dev-postgres
seems to have worked. Will PR a few changes to the README.
I can confirm that a new chart can be started with the same name after those deletions.
...mmm... I also discovered a rogue secret left there hanging.
After all of this, I still cannot start a new deployment with the same name. I get an endless wait on the database to start:
NAME READY STATUS RESTARTS AGE
galaxy-dev-celery-beat-bbdb788df-lzh4k 0/1 Init:0/1 0 2m53s
galaxy-dev-celery-fd5b894f8-rmffb 0/1 Init:0/1 0 2m53s
galaxy-dev-init-db-bfrrx-kfq2z 0/1 Init:0/1 0 2m53s
galaxy-dev-job-0-df446dc6f-t7wsj 0/1 Init:0/1 0 2m53s
galaxy-dev-nginx-75fc94497f-45nvp 1/1 Running 0 2m53s
galaxy-dev-postgres-77d867c998-dcxfm 1/1 Running 0 2m53s
galaxy-dev-rabbitmq-865b44f65f-cr87f 1/1 Running 0 2m53s
galaxy-dev-rabbitmq-messaging-topology-operator-7b67965f94gpj86 1/1 Running 0 2m53s
galaxy-dev-rabbitmq-server-server-0 1/1 Running 0 2m12s
galaxy-dev-tusd-6bf6456765-qm6zh 1/1 Running 0 2m53s
galaxy-dev-web-64646d6d6f-rvd5t 0/1 Init:0/1 0 2m53s
galaxy-dev-workflow-57b8d8f6f7-lppzz 0/1 Init:0/1 0 2m53s
galaxy-galaxy-dev-postgres-0 0/1 Running 0 2m13s
(I have seen this with more than 30 minutes going on). Something seems to keep postgres without defining a leader:
kubectl logs -f galaxy-galaxy-dev-postgres-0
2022-12-19 17:51:50,379 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-12-19 17:51:50,996 - bootstrapping - INFO - Looks like your running openstack
2022-12-19 17:51:51,037 - bootstrapping - INFO - Configuring pgqd
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring pgbouncer
2022-12-19 17:51:51,038 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring wal-e
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring bootstrap
2022-12-19 17:51:51,038 - bootstrapping - INFO - Configuring certificate
2022-12-19 17:51:51,038 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-12-19 17:51:51,116 - bootstrapping - INFO - Configuring log
2022-12-19 17:51:51,117 - bootstrapping - INFO - Configuring patroni
2022-12-19 17:51:51,131 - bootstrapping - INFO - Writing to file /run/postgres.yml
2022-12-19 17:51:51,132 - bootstrapping - INFO - Configuring crontab
2022-12-19 17:51:51,132 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-12-19 17:51:51,132 - bootstrapping - INFO - Configuring pam-oauth2
2022-12-19 17:51:51,133 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-12-19 17:51:51,133 - bootstrapping - INFO - Configuring standby-cluster
2022-12-19 17:51:52,420 INFO: Selected new K8s API server endpoint https://192.168.42.234:6443
2022-12-19 17:51:52,442 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-12-19 17:51:52,448 INFO: Lock owner: None; I am galaxy-galaxy-dev-postgres-0
2022-12-19 17:51:52,485 INFO: waiting for leader to bootstrap
2022-12-19 17:52:02,979 INFO: Lock owner: None; I am galaxy-galaxy-dev-postgres-0
2022-12-19 17:52:02,980 INFO: waiting for leader to bootstrap
2022-12-19 17:52:12,960 INFO: Lock owner: None; I am galaxy-galaxy-dev-postgres-0
2022-12-19 17:52:12,961 INFO: waiting for leader to bootstrap
it keeps looping on the last two lines.
Part of the issue here is that some dependencies, especially operators, should not be removed before dependent CRs are removed. Otherwise, the operator is removed before it has a chance to cleanup the CR, leaving artefacts behind. We've discussed this in the past and one solution that came up was that we could perhaps separate the steps for installing operators, cvmfs-csi and other dependencies as a pre-installation step. In the end, the consensus was to keep things as is, but document that in production scenarios, one should manually install the operators, and set postgresql.deploy=False
, rabbitmq.deploy=False
etc. on the chart. The documentation still needs to be updated.
As a follow-up here, last week up updated the GalaxyKubeMan chart to explicitly remove some resources in an orderly fashion and enable clean deletion of a deployment: https://github.com/galaxyproject/galaxykubeman-helm/blob/anvil/galaxykubeman/templates/job-predelete.yaml
Doing:
leaves behind some constructs:
that I could find.
I suspect issuing:
should get rid of the rabbitmq components, but I don't seem to find a unified call to bring down the postgres components (none of the specific API objects for postgres seem present, besides the statefulset and pods).