Open phong2tran opened 8 months ago
It seems like you are upgrading from an ancient version of Postgres. This issue was fixed here: https://github.com/cloudfoundry/bpm-release/pull/152
Thank you so much for the response @rkoster! Indeed we're operating an "outdated" BOSH environment and have not done the upgrade regularly as we should. We have seen this issue intermittently on a few runs of BOSH Director upgrade testing.
How can we move forward with this BOSH Director v280.0.14 upgrade and ensure that this issue won't happen in our existing production BOSH environments?
Option 1: Can we first manually shut down Postgres 10 on the BOSH Director VM before attempting BOSH Director upgrade? If yes, which command sequences should be used to properly shut down Postgres 10 and other BOSH Director related services?
Option 2: First update BPM component to v1.1.14 or higher (https://github.com/cloudfoundry/bpm-release/pull/152#issuecomment-938235720) with the fix on current BOSH Director v271.2.0 before upgrading to BOSH Director v280.0.14.
Any other options? Greatly appreciate your suggestions here.
Updating BPM would still be an update of the instance, and as such have a change of an improper Postgres shutdown.
@bgandon do you remember if there was a workaround that was used before the fix was implemented?
Hi @bgandon, As @rkoster confirmed using Option 2 will likely run into the same improper Postgres shutdown. Could you please advice on the workaround you used before the BPM fix was implemented if it's possible?
We're thinking of using the Option 1 as a workaround for manually shutting down Postgres 10 on the BOSH Director VM before attempting BOSH Director upgrade. Please help to confirm if the following steps will work.
bosh/0:~# for name in "credhub" "uaa" "health_monitor" "director_nginx" "director_sync_dns" "director_scheduler" "blobstore_nginx" "nats" "director"; do monit stop "${name}"; done
bosh/0:~# monit summary
The Monit daemon 5.2.5 uptime: 7d 2h 19m
Process 'nats' not monitored Process 'postgres' running Process 'blobstore_nginx' not monitored Process 'director' not monitored Process 'worker_1' not monitored Process 'worker_2' not monitored Process 'worker_3' not monitored Process 'worker_4' not monitored Process 'director_scheduler' not monitored Process 'director_sync_dns' not monitored Process 'director_nginx' not monitored Process 'health_monitor' not monitored Process 'uaa' not monitored Process 'credhub' not monitored System 'system_be0914a6-1473-47f1-58d9-4f3aacbe2ab5' running
3. Umonitor Postgres process, so monit won't restart it when Postgres is shutdown using "kill" command directly later.
bosh/0:~# monit unmonitor postgres
bosh/0:~# monit summary The Monit daemon 5.2.5 uptime: 7d 2h 54m
Process 'nats' not monitored Process 'postgres' not monitored Process 'blobstore_nginx' not monitored Process 'director' not monitored Process 'worker_1' not monitored Process 'worker_2' not monitored Process 'worker_3' not monitored Process 'worker_4' not monitored Process 'director_scheduler' not monitored Process 'director_sync_dns' not monitored Process 'director_nginx' not monitored Process 'health_monitor' not monitored Process 'uaa' not monitored Process 'credhub' not monitored System 'system_be0914a6-1473-47f1-58d9-4f3aacbe2ab5' running
4. Shutdown Postgres using "kill" command with SIGINT signal for fast mode shutdown.
bosh/0:~# postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10) && kill -s SIGINT "${postgres_pid}"
5. Check Postgres database cluster state and ensure it's been shutting down properly with "shut down" state instead of "in production"
bosh/0:~# su - vcap -c "/var/vcap/packages/postgres-10/bin/pg_controldata -D /var/vcap/store/postgres-10" | grep -F "Database cluster state" Database cluster state: shut down
6. If Postgres database cluster state is in "shut down", then exit the BOSH Director VM and proceed with the BOSH Director upgrade as usual.
Describe the bug Failed on upgrading BOSH Director from v271.2.0 to v280.0.14
To Reproduce Steps to reproduce the behavior (example): Deploy a bosh director v271.2.0 on vSphere:
Upload stemcell ubuntu-bionic 1.92
Deploy cf-deployment 21.5.0.
Upgrade the current bosh director v271.2.0 to v280.0.14
The pre-start script of the postgres job failed.
Expected behavior BOSH Director should be successfully upgraded from v271.2.0 to v280.0.14
Logs When sshing into the BOSH Director VM, I found this error in /var/vcap/sys/log/postgres/pre-start.stdout.log:
When BOSH Director is migrating the database from Postgres 10 to Postgres 15 during the upgrade, it's complaining about the source database (Postgres 10?) is not shutdown cleanly. I attempted to rerun the BOSH Director upgrade several times, but it did not help.
Versions (please complete the following information):
Deployment info: We're using "bosh create-env" command with bosh-deployment to create and upgrade BOSH Director environment. BOSH Director creation script:
new bosh-deployment: https://github.com/cloudfoundry/bosh-deployment/tree/15cbd254db78ab49ef957f2d80ffd2901b09d6e5
Additional context Add any other context about the problem here.