cloudfoundry / postgres-release

BOSH release for PostgreSQL
Apache License 2.0
14 stars 36 forks source link

DB upgrade leads to bosh deploy timeout #25

Closed jsievers closed 6 years ago

jsievers commented 7 years ago

We recently upgraded the postgres release used in concourse as advertised in the concourse 3.5.0 release notes

I did read https://github.com/cloudfoundry/postgres-release/#upgrading and increased the databases.monit_timeout to 300 seconds.

this allowed the DB upgrade to finish without a monit timeout (it took about 2 minutes) according to /var/vcap/sys/log/postgres/postgres_ctl.log , but still bosh deploy failed with

"time":1508245109,"stage":"Updating instance","tags":["db"],"total":1,"task":"db/c58db631-411d-4390-9787-734be1d88eca[98/6448]
ary)","index":1,"state":"failed","progress":100,"data":{"error":"''db/c58db631-411d-4390-9787-734be1d88eca (0)'' is not running
 after update. Review logs for failed jobs: postgres"}}
{"time":1508245109,"error":{"code":400007,"message":"''db/c58db631-411d-4390-9787-734be1d88eca (0)'' is not running after updat
e. Review logs for failed jobs: postgres"}}
', "result_output" = '', "context_id" = '' WHERE ("id" = 11605)
D, [2017-10-17 12:58:29 #10554] [task:11605] DEBUG -- DirectorJobRunner: (0.001595s) COMMIT
I, [2017-10-17 12:58:29 #10554] []  INFO -- DirectorJobRunner: Task took 2 minutes 53.07573839 seconds to process.

According to bosh lifecycle docs, there is another timeout (probably update_watch_time) which is exceeded.

Rather than increasing update_watch_time for all jobs, according to bosh lifecycle docs it seems that a pre_start script would be a better lifecycle to perform long-running tasks like a DB upgrade because it does not timeout on the bosh level.

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/152037934

The labels on this github issue will be updated when the story is started.

jsievers commented 7 years ago

our current workaround is to increase both databases.monit_timeout of the postgres job as well as canary_watch_time and update_watch_time in the update block of the concourse manifest to 20 minutes (we have a ~10GB postgres DB)

valeriap commented 6 years ago

Fixed in v23