cloudfoundry / postgres-release

BOSH release for PostgreSQL
Apache License 2.0
14 stars 36 forks source link

postgres fails to restart after VM reboot #33

Closed jsievers closed 6 years ago

jsievers commented 6 years ago

Steps to reproduce

Observed behaviour

from postgres_ctl.err.log :

[2018-01-05 12:53:31+0000] + main start
[2018-01-05 12:53:31+0000] + local action
[2018-01-05 12:53:31+0000] + action=start
[2018-01-05 12:53:31+0000] + source /var/vcap/jobs/postgres/bin/pgconfig.sh
[2018-01-05 12:53:31+0000] ++ current_version=9.6.4
[2018-01-05 12:53:31+0000] ++ pgversion_current=postgres-9.6.4
[2018-01-05 12:53:31+0000] ++ pgversion_old=postgres-9.6.3
[2018-01-05 12:53:31+0000] ++ pgversion_older=postgres-9.6.2
[2018-01-05 12:53:31+0000] ++ JOB_DIR=/var/vcap/jobs/postgres
[2018-01-05 12:53:31+0000] ++ PACKAGE_DIR=/var/vcap/packages/postgres-9.6.4
[2018-01-05 12:53:31+0000] ++ STORE_DIR=/var/vcap/store
[2018-01-05 12:53:31+0000] ++ PG_STORE_DIR=/var/vcap/store/postgres
[2018-01-05 12:53:31+0000] ++ DATA_DIR=/var/vcap/store/postgres/postgres-9.6.4
[2018-01-05 12:53:31+0000] ++ DATA_DIR_PREVIOUS=/var/vcap/store/postgres/postgres-previous
[2018-01-05 12:53:31+0000] ++ DATA_DIR_OLD=/var/vcap/store/postgres/postgres-unknown
[2018-01-05 12:53:31+0000] ++ PACKAGE_DIR_OLD=/var/vcap/packages/postgres-unknown
[2018-01-05 12:53:31+0000] ++ POSTGRES_UPGRADE_LOCK=/var/vcap/store/postgres/POSTGRES_UPGRADE_LOCK
[2018-01-05 12:53:31+0000] ++ pgversion_upgrade_from=postgres-unknown
[2018-01-05 12:53:31+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.3 -a -f /var/vcap/store/postgres/postgres-9.6.3/postgresql.conf ']'
[2018-01-05 12:53:31+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.2 -a -f /var/vcap/store/postgres/postgres-9.6.2/postgresql.conf ']'
[2018-01-05 12:53:31+0000] ++ RUN_DIR=/var/vcap/sys/run/postgres
[2018-01-05 12:53:31+0000] ++ LOG_DIR=/var/vcap/sys/log/postgres
[2018-01-05 12:53:31+0000] ++ PIDFILE=/var/vcap/sys/run/postgres/postgres.pid
[2018-01-05 12:53:31+0000] ++ CONTROL_JOB_PIDFILE=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:53:31+0000] ++ HOST=0.0.0.0
[2018-01-05 12:53:31+0000] ++ PORT=5524
[2018-01-05 12:53:31+0000] ++ [[ -n '' ]]
[2018-01-05 12:53:31+0000] ++ LD_LIBRARY_PATH=/var/vcap/packages/postgres-9.6.4/lib
[2018-01-05 12:53:31+0000] + set +u
[2018-01-05 12:53:31+0000] + source /var/vcap/packages/postgres-common/utils.sh
[2018-01-05 12:53:31+0000] + set -u
[2018-01-05 12:53:31+0000] + case "${action}" in
[2018-01-05 12:53:31+0000] + pid_guard /var/vcap/sys/run/postgres/postgresctl.pid 'Postgres control job' false
[2018-01-05 12:53:31+0000] + local pidfile=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:53:31+0000] + local 'name=Postgres control job'
[2018-01-05 12:53:31+0000] + local logall=false
[2018-01-05 12:53:31+0000] + '[' false == true ']'
[2018-01-05 12:53:31+0000] + '[' -f /var/vcap/sys/run/postgres/postgresctl.pid ']'
[2018-01-05 12:53:31+0000] + echo 1315
[2018-01-05 12:53:31+0000] /var/vcap/jobs/postgres/bin/postgres_ctl: line 18: /var/vcap/sys/run/postgres/postgresctl.pid: No such file or directory
[2018-01-05 12:55:11+0000] + main start
[2018-01-05 12:55:11+0000] + local action
[2018-01-05 12:55:11+0000] + action=start
[2018-01-05 12:55:11+0000] + source /var/vcap/jobs/postgres/bin/pgconfig.sh
[2018-01-05 12:55:11+0000] ++ current_version=9.6.4
[2018-01-05 12:55:11+0000] ++ pgversion_current=postgres-9.6.4
[2018-01-05 12:55:11+0000] ++ pgversion_old=postgres-9.6.3
[2018-01-05 12:55:11+0000] ++ pgversion_older=postgres-9.6.2
[2018-01-05 12:55:11+0000] ++ JOB_DIR=/var/vcap/jobs/postgres
[2018-01-05 12:55:11+0000] ++ PACKAGE_DIR=/var/vcap/packages/postgres-9.6.4
[2018-01-05 12:55:11+0000] ++ STORE_DIR=/var/vcap/store
[2018-01-05 12:55:11+0000] ++ PG_STORE_DIR=/var/vcap/store/postgres
[2018-01-05 12:55:11+0000] ++ DATA_DIR=/var/vcap/store/postgres/postgres-9.6.4
[2018-01-05 12:55:11+0000] ++ DATA_DIR_PREVIOUS=/var/vcap/store/postgres/postgres-previous
[2018-01-05 12:55:11+0000] ++ DATA_DIR_OLD=/var/vcap/store/postgres/postgres-unknown
[2018-01-05 12:55:11+0000] ++ PACKAGE_DIR_OLD=/var/vcap/packages/postgres-unknown
[2018-01-05 12:55:11+0000] ++ POSTGRES_UPGRADE_LOCK=/var/vcap/store/postgres/POSTGRES_UPGRADE_LOCK
[2018-01-05 12:55:11+0000] ++ pgversion_upgrade_from=postgres-unknown
[2018-01-05 12:55:11+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.3 -a -f /var/vcap/store/postgres/postgres-9.6.3/postgresql.conf ']'
[2018-01-05 12:55:11+0000] ++ '[' -d /var/vcap/store/postgres/postgres-9.6.2 -a -f /var/vcap/store/postgres/postgres-9.6.2/postgresql.conf ']'
[2018-01-05 12:55:11+0000] ++ RUN_DIR=/var/vcap/sys/run/postgres
[2018-01-05 12:55:11+0000] ++ LOG_DIR=/var/vcap/sys/log/postgres
[2018-01-05 12:55:11+0000] ++ PIDFILE=/var/vcap/sys/run/postgres/postgres.pid
[2018-01-05 12:55:11+0000] ++ CONTROL_JOB_PIDFILE=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:55:11+0000] ++ HOST=0.0.0.0
[2018-01-05 12:55:11+0000] ++ PORT=5524
[2018-01-05 12:55:11+0000] ++ [[ -n '' ]]
[2018-01-05 12:55:11+0000] ++ LD_LIBRARY_PATH=/var/vcap/packages/postgres-9.6.4/lib
[2018-01-05 12:55:11+0000] + set +u
[2018-01-05 12:55:11+0000] + source /var/vcap/packages/postgres-common/utils.sh
[2018-01-05 12:55:11+0000] + set -u
[2018-01-05 12:55:11+0000] + case "${action}" in
[2018-01-05 12:55:11+0000] + pid_guard /var/vcap/sys/run/postgres/postgresctl.pid 'Postgres control job' false
[2018-01-05 12:55:11+0000] + local pidfile=/var/vcap/sys/run/postgres/postgresctl.pid
[2018-01-05 12:55:11+0000] + local 'name=Postgres control job'
[2018-01-05 12:55:11+0000] + local logall=false
[2018-01-05 12:55:11+0000] + '[' false == true ']'
[2018-01-05 12:55:11+0000] + '[' -f /var/vcap/sys/run/postgres/postgresctl.pid ']'
[2018-01-05 12:55:11+0000] + echo 1473
[2018-01-05 12:55:11+0000] /var/vcap/jobs/postgres/bin/postgres_ctl: line 18: /var/vcap/sys/run/postgres/postgresctl.pid: No such file or directory

the important line from log above is

[2018-01-05 12:53:31+0000] /var/vcap/jobs/postgres/bin/postgres_ctl: line 18: /var/vcap/sys/run/postgres/postgresctl.pid: No such file or directory

Expected behaviour

postgres is able to survive a VM reboot.

Workaround

bosh ssh into the postgres VM and manually execute

mkdir -p /var/vcap/sys/run/postgres/
chown -R vcap:vcap /var/vcap/sys/run/
monit start postgres

Possible fix

there are similar bugs already fixed in other bosh releases, see e.g. https://github.com/cloudfoundry/capi-release/issues/25

I think the problem is that the control scripts rely on prestart to create $RUN_DIR:

https://github.com/cloudfoundry/postgres-release/blob/6e31bafab3f03fe0e0adf286f044168f0b3f3d95/jobs/postgres/templates/pre-start.sh.erb#L49-L50

and in case of external VM reboot (i.e. outside of any bosh operation), prestart is not executed. /var/vcap/data/sys/run/ is a temp file system

$ mount | grep sys/run
tmpfs on /var/vcap/data/sys/run type tmpfs (rw,size=1m)

and thus is empty after VM reboot (which is fine).

although there is an mkdir -p $RUN_DIR in https://github.com/cloudfoundry/postgres-release/blob/6e31bafab3f03fe0e0adf286f044168f0b3f3d95/jobs/postgres/templates/postgres_ctl.sh.erb#L42

, this is too late in case of VM reboot because of this line executed earlier:

https://github.com/cloudfoundry/postgres-release/blob/6e31bafab3f03fe0e0adf286f044168f0b3f3d95/jobs/postgres/templates/postgres_ctl.sh.erb#L18

which assumes the directory $RUN_DIR already exists.

Initialization of $RUN_DIR should probably happen as the very first thing when start) is invoked, see the fix for a similar problem in cloud_controller_ng

In our scenario (we run CCDB on postgres-release) this bug has caused an extended Cloudfoundry downtime triggered by forced VM reboots which were required to deploy urgent security fixes on the IaaS level.

cf-gitbot commented 6 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/154085953

The labels on this github issue will be updated when the story is started.

valeriap commented 6 years ago

@jsievers Thanks for reporting this issue. The fix in included in v24.