Failed on upgrading BOSH Director from v271.2.0 to v280.0.14

phong2tran commented 8 months ago

Describe the bug Failed on upgrading BOSH Director from v271.2.0 to v280.0.14

To Reproduce Steps to reproduce the behavior (example): Deploy a bosh director v271.2.0 on vSphere:

$ ./create-env.sh sandbox-cfar 271.2.0
Deployment manifest: '/SANDBOX-CFAR/bosh-director/bosh-deployment-271.2.0/bosh.yml'
Deployment state: '/SANDBOX-CFAR/bosh-director/sandbox-cfar-state.json'

Started validating
  Downloading release 'bosh'... Skipped [Found in local cache] (00:00:00)
  Validating release 'bosh'... Finished (00:00:03)
  Downloading release 'bpm'... Finished (00:00:03)
  Validating release 'bpm'... Finished (00:00:02)
  Downloading release 'bosh-vsphere-cpi'... Finished (00:00:00)
  Validating release 'bosh-vsphere-cpi'... Finished (00:00:01)
  Downloading release 'uaa'... Finished (00:00:09)
  Validating release 'uaa'... Finished (00:00:05)
  Downloading release 'credhub'... Finished (00:00:03)
  Validating release 'credhub'... Finished (00:00:02)
  Downloading release 'os-conf'... Finished (00:00:00)
  Validating release 'os-conf'... Finished (00:00:00)
  Downloading release 'backup-and-restore-sdk'... Finished (00:00:05)
  Validating release 'backup-and-restore-sdk'... Finished (00:00:09)
  Validating cpi release... Finished (00:00:00)
  Validating deployment manifest... Finished (00:00:00)
  Downloading stemcell... Finished (00:00:12)
  Validating stemcell... Finished (00:00:05)
Finished validating (00:01:26)

Started installing CPI
  Compiling package 'ruby-2.6.5-r0.29.0/269dc54d5306119b0e4f89be04f6c470b4876f552753815586fd1ab8ebeaa70d'... Finished (00:04:19)
  Compiling package 'vsphere_cpi/5dffb632edb799be8e2c7aeed263409627b201d6143ce427621f40d6dd461993'... Finished (00:01:53)
  Compiling package 'iso9660wrap/b9eee11ca7251f93ef853db345596783012ae26b5d6ec5cb3d29bf295899c973'... Finished (00:00:00)
  Installing packages... Finished (00:00:01)
  Rendering job templates... Finished (00:00:00)
  Installing job 'vsphere_cpi'... Finished (00:00:00)
Finished installing CPI (00:06:15)

Starting registry... Finished (00:00:00)
Uploading stemcell 'bosh-vsphere-esxi-ubuntu-bionic-go_agent/1.92'... Finished (00:01:26)

Started deploying
  Creating VM for instance 'bosh/0' from stemcell 'sc-74133471-3d5c-4444-8ae0-1b749056bf79'... Finished (00:01:16)
  Waiting for the agent on VM 'vm-2e30ee54-968d-4407-b0e2-0a2c448f6695' to be ready... Finished (00:00:10)
  Creating disk... Finished (00:00:28)
  Attaching disk 'disk-36f89546-442f-4600-b482-ed148588a756' to VM 'vm-2e30ee54-968d-4407-b0e2-0a2c448f6695'... Finished (00:01:08)
  Rendering job templates... Finished (00:00:22)
  Compiling package 'golang/7b633f7a140b41ef9427109d0f3032cf81445ead'... Finished (00:00:27)
  Compiling package 'ruby-2.6.5-r0.29.0/269dc54d5306119b0e4f89be04f6c470b4876f552753815586fd1ab8ebeaa70d'... Finished (00:03:18)
  Compiling package 'mysql/788d06685e1ea1d316759eeeb506782ec7f9302f8c21e2ff04cd4703579f0935'... Finished (00:00:46)
  Compiling package 'libpq/ecbfa62322b4124f25372a19d68b83295b4d290503153667ec378e3196c45f69'... Finished (00:00:28)
  Compiling package 'ruby-2.6.5-r0.29.0/269dc54d5306119b0e4f89be04f6c470b4876f552753815586fd1ab8ebeaa70d'... Finished (00:03:15)
  Compiling package 'database-backup-restorer-boost/05f72399bdd8d91643f42ac411ba65befb78ac0334484dbc3ca95c5286ab7680'... Finished (00:00:19)
  Compiling package 'tini/3d7b02f3eeb480b9581bec4a0096dab9ebdfa4bc'... Finished (00:00:02)
  Compiling package 'bpm-runc/3dcaebacd63b8adc75c5f32954f11041885347b1'... Finished (00:01:47)
  Compiling package 'openjdk_1.8.0/225f67373c9ad0a1da464aeb92f06207bd3e8da1'... Finished (00:00:08)
  Compiling package 'golang-1-linux/7fdbb13e913f2f05232da046b27642ceebab32adf2e78ef3582b63ae6d60df96'... Finished (00:00:27)
  Compiling package 'libpcre2/d5cd2e4263fda94bfeec68d2a388b9e6bb17fa15e28e09c99ebe6a4faa3328f5'... Finished (00:00:14)
  Compiling package 'director/f32385256198535b797059dd4990fcb3b65c0c07337990163c24275a7a29b7e1'... Finished (00:01:25)
  Compiling package 'verify_multidigest/64d1958934e10a0eccc05ddf0d7ba0c8215e6f6d4c227cb93998087335378fa8'... Finished (00:00:01)
  Compiling package 'vsphere_cpi/5dffb632edb799be8e2c7aeed263409627b201d6143ce427621f40d6dd461993'... Finished (00:01:18)
  Compiling package 'davcli/58f558960854f58c55e3d506d3906019178dbc189fbbed1616b8b3c7c02142ea'... Finished (00:00:01)
  Compiling package 'gonats/f58980bd4b0436ff65f588627116dfff63f346f4d13175b7ba47380ab89e08a6'... Finished (00:00:01)
  Compiling package 'database-backup-restorer-postgres-9.4/70d321821ff300fbaef47d64fb7f7b5d33ede23c2349cbf1950886c40f25c2e8'... Finished (00:04:36)
  Compiling package 'database-backup-restorer-postgres-10/41f9bdf0c158e18e850a5744250a39b425f385529b234941c9acf1f6631a3424'... Finished (00:05:14)
  Compiling package 'database-backup-restorer-mysql-5.7/81418214987edce3b03159014ac68449689086d696be746e14857f7551f8f3f6'... Finished (00:02:51)
  Compiling package 'nginx/d4cf69d3e81bed005ebba5bc0bc8d2c28252e70ad47ff455479a9838d5f9b0e4'... Finished (00:01:02)
  Compiling package 'database-backup-restorer-postgres-13/0c18508216826e03c23c623d2f1989405831375c9d457e0ac619125c32b15371'... Finished (00:06:01)
  Compiling package 'database-backup-restorer-postgres-11/be5ee4b5015679ea4d92295ea1eb9a58480c3fff155f69cd1a92f800c11a0c91'... Finished (00:05:38)
  Compiling package 'bpm/818bd9ec39fa5e179c5406c1690fb7c6deb0fc4d'... Finished (00:00:11)
  Compiling package 'postgres-9.4/601f3635b43d0e7ba3ae866e3bd69425cdf33f7fb34a7f1bb21cc26818fb598e'... Finished (00:04:31)
  Compiling package 'credhub/33ea568aad1d35e9522c56f792d3d4fc3cd5975d'... Finished (00:00:07)
  Compiling package 's3cli/7e752dee192da026f6a0cdf2653b855cc6efbe6b041564660f8520c39ddd5a78'... Finished (00:00:02)
  Compiling package 'health_monitor/dd842698e83edeae08bdcc6e672429a5cee3b755645d2024d97b6213f1281d44'... Finished (00:00:34)
  Compiling package 'database-backup-restorer/7c0d80a713009aecb8d6533918a2bf45f7ad0319f50ecca1789fc230aa6d5dd9'... Finished (00:00:06)
  Compiling package 'database-backup-restorer-mariadb/af78e79c98c11c29a721b1d7ba554dd7d0bf25e2789fa933b96bbfd67d697465'... Finished (00:02:12)
  Compiling package 'luna-hsm-client-7.4/746f3c30aadc0af7afc2d5cddcc16d8836a8f845'... Finished (00:00:04)
  Compiling package 'postgres-10/708f8446db4ac7bb21bddce9938e217c741a6e6f82f6209f7e6f6a2b5b25eed3'... Finished (00:05:05)
  Compiling package 'bosh-gcscli/52223432539bbd0607db053f542440869688b4404dd65f2ddf33c2d195b1b891'... Finished (00:00:02)
  Compiling package 'uaa/4f77a97610b962f50d0c21067b48bd467db6066855318c766af8bc1cb990e799'... Finished (00:00:35)
  Compiling package 'iso9660wrap/b9eee11ca7251f93ef853db345596783012ae26b5d6ec5cb3d29bf295899c973'... Finished (00:00:01)
  Compiling package 'database-backup-restorer-mysql-5.6/01bf18f19277261bcccac9736d7634b49eb184a93cd6549b78f4e1d75eabe35a'... Finished (00:02:14)
  Compiling package 'database-backup-restorer-postgres-9.6/6a8fcf2d66b67507403df885b84c4b7cc1d66289f2d7efc5914b43dd2305491c'... Finished (00:05:07)
  Updating instance 'bosh/0'... Finished (00:03:08)
  Waiting for instance 'bosh/0' to be running... Finished (00:01:46)
  Running the post-start scripts 'bosh/0'... Finished (00:00:21)
Finished deploying (01:09:07)

Stopping registry... Finished (00:00:00)
Cleaning up rendered CPI jobs... Finished (00:00:00)

Succeeded

root@0036416c4de8:/SANDBOX-CFAR/bosh-director# . bosh-env.sh sandbox-cfar 271.2.0
root@0036416c4de8:/SANDBOX-CFAR/bosh-director# bosh env
Using environment '10.9.202.186' as client 'admin'

Name               sandbox-cfar
UUID               a234617f-6e58-462f-ac51-52c722c3834b
Version            271.2.0 (00000000)
Director Stemcell  ubuntu-bionic/1.92
CPI                vsphere_cpi
Features           compiled_package_cache: disabled
                   config_server: enabled
                   local_dns: enabled
                   power_dns: disabled
                   snapshots: disabled
User               admin

Succeeded

Upload stemcell ubuntu-bionic 1.92

Deploy cf-deployment 21.5.0.

Upgrade the current bosh director v271.2.0 to v280.0.14

$ ./create-env.sh sandbox-cfar 280.0.14
Deployment manifest: '/var/vcap/store/deployment-vm/home/ptran/workspace/SANDBOX-CFAR/bosh-director/bosh-deployment-280.0.14/bosh.yml'
Deployment state: '/var/vcap/store/deployment-vm/home/ptran/workspace/SANDBOX-CFAR/bosh-director/sandbox-cfar-state.json'

Started validating
  Downloading release 'bosh'... Finished (00:00:01)
  Validating release 'bosh'... Finished (00:00:01)
  Downloading release 'bpm'... Finished (00:00:00)
  Validating release 'bpm'... Finished (00:00:00)
  Downloading release 'bosh-vsphere-cpi'... Finished (00:00:01)
  Validating release 'bosh-vsphere-cpi'... Finished (00:00:02)
  Downloading release 'uaa'... Finished (00:00:03)
  Validating release 'uaa'... Finished (00:00:02)
  Downloading release 'credhub'... Finished (00:00:01)
  Validating release 'credhub'... Finished (00:00:01)
  Downloading release 'os-conf'... Finished (00:00:00)
  Validating release 'os-conf'... Finished (00:00:00)
  Downloading release 'backup-and-restore-sdk'... Finished (00:00:04)
  Validating release 'backup-and-restore-sdk'... Finished (00:00:03)
  Validating cpi release... Finished (00:00:00)
  Validating deployment manifest... Finished (00:00:00)
  Downloading stemcell... Skipped [Found in local cache] (00:00:00)
  Validating stemcell... Finished (00:00:12)
Finished validating (00:00:39)

Started installing CPI
  Compiling package 'ruby-3.1/8b225e7cc2608305a7b784b5828b2b4b7c7adc3eb14af46e313d64a9e14a3ad6'... Finished (00:03:39)
  Compiling package 'golang-1-darwin/e6383fc2adbcb1dc5ab18d32b737b1729ff3226b774a358504a44bc5d6bd097f'... Finished (00:00:23)
  Compiling package 'golang-1-linux/c2342901fca75f4c7ec3f32e6a757e923089c6c50d8eb3effd2c25eac1009e31'... Finished (00:00:24)
  Compiling package 'vsphere_cpi/54bcc7a48ba47cc7df2b8dd4704bc8dbb46b945b1a91cbc147262803557a6a7a'... Finished (00:00:35)
  Compiling package 'iso9660wrap/b351c796826a0a3a57e13bad036c12a3958c38f9370bbb50540e782582baaf79'... Finished (00:00:31)
  Installing packages... Finished (00:00:07)
  Rendering job templates... Finished (00:00:00)
  Installing job 'vsphere_cpi'... Finished (00:00:00)
Finished installing CPI (00:05:41)

Uploading stemcell 'bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.340'... Skipped [Stemcell already uploaded] (00:00:00)

Started deploying
  Waiting for the agent on VM 'vm-aef0966d-e843-41ff-873d-2acfe6ee88bb'... Finished (00:00:00)
  Draining jobs on instance 'unknown/0'... Finished (00:00:07)
  Stopping jobs on instance 'unknown/0'... Finished (00:00:00)
  Unmounting disk 'disk-36f89546-442f-4600-b482-ed148588a756'... Finished (00:00:01)
  Deleting VM 'vm-aef0966d-e843-41ff-873d-2acfe6ee88bb'... Finished (00:00:22)
  Creating VM for instance 'bosh/0' from stemcell 'sc-74437d41-122f-4224-a3e1-6266ff62e4df'... Finished (00:00:58)
  Waiting for the agent on VM 'vm-57d3af3a-29bf-4b39-944b-3bcb03d5a164' to be ready... Finished (00:00:29)
  Attaching disk 'disk-36f89546-442f-4600-b482-ed148588a756' to VM 'vm-57d3af3a-29bf-4b39-944b-3bcb03d5a164'... Finished (00:00:40)
  Rendering job templates... Finished (00:00:28)
  Compiling package 'golang-1-linux/c2342901fca75f4c7ec3f32e6a757e923089c6c50d8eb3effd2c25eac1009e31'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'golang-1-darwin/e6383fc2adbcb1dc5ab18d32b737b1729ff3226b774a358504a44bc5d6bd097f'... Finished (00:00:36)
  Compiling package 'golang-1-linux/c2342901fca75f4c7ec3f32e6a757e923089c6c50d8eb3effd2c25eac1009e31'... Finished (00:00:35)
  Compiling package 'ruby-3.1/8b225e7cc2608305a7b784b5828b2b4b7c7adc3eb14af46e313d64a9e14a3ad6'... Finished (00:15:25)
  Compiling package 'director-ruby-3.2/84ee2f9d0485530a75822fa03e7fd0c73544aa4c2f6fe24aaebebe1757195efe'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'tini/3d7b02f3eeb480b9581bec4a0096dab9ebdfa4bc'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'bpm-runc/923e2cae4f8f54cd58de0349352bb14f8662cfa5'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'libopenssl1/7f27f8cdc6cd6f6f865bfbe67ab853977e1505d2ca558415df9bf692eb1b0d63'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'openjdk_17.0/a805b67e0bbf99e97ca878960971301e56d951f67ab5ca14be11553b356556e8'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-boost/05f72399bdd8d91643f42ac411ba65befb78ac0334484dbc3ca95c5286ab7680'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'libpcre2/22fb4c5ee63919fa1e4b1e720fe048f8c55d8998858aeb8172ca67cbdcd0e6de'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'mysql/7ec79ca2b57047da0b337c62944439493b60c1bd5a2767444362cfd1c7b2bbd9'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'libpq/b309a72768019e24e2c592f3f25ded2679e98cbb90f774c3a4d6b7745760079f'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'golang-1-linux/c2342901fca75f4c7ec3f32e6a757e923089c6c50d8eb3effd2c25eac1009e31'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'postgres-15/1059ac62d543dc19011001f80f8c0bb99cc3a9ea4f8c14736e480701051ce9f0'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-postgres-15/162c4cca97dcfd5b12d4241bf40ae421cb3c4fbdbf215ce601f3267865501f66'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'luna-hsm-client-7.4/5956cbd4d17c28c2e4c29f3906e3faddc1d7b921708740f1a532a37d5b6fbe29'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'iso9660wrap/b351c796826a0a3a57e13bad036c12a3958c38f9370bbb50540e782582baaf79'... Finished (00:00:29)
  Compiling package 'vsphere_cpi/54bcc7a48ba47cc7df2b8dd4704bc8dbb46b945b1a91cbc147262803557a6a7a'... Finished (00:01:07)
  Compiling package 'database-backup-restorer-mysql-8.0/488fb8d45895a348f88ca2984fa36939687ad6978deebabd8ee70a1514776f17'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'nats/52d36e5308f7aeced172092016c0fd34f9195ff2788d3106fc2d5cf1ac192c1a'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'bpm/a37a126c1b31da99ab252f4668953a38c4748864'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-mysql-5.6/86603abfbb0d59ebf924449e97fecc422af66d7941bf5498a05099b653a8d3eb'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-postgres-13/ea27ff50286f247ab3acdb3c7cc2101c6d7a666a4eec7c669f7e34e3ef1b51e6'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-postgres-11/b9125bf430a1cf1d00ab83c72e4c5be26f6de52c5315b82beda286d31f4e7cc1'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'davcli/ca2605d13c62b479a215162ea17769326d6f7e37d1002c85816534013235b7d4'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'credhub/e3913a55fb5116fdca99c6403a19a94e7e051e4cd255ab972be279f86ef50de9'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'azure-storage-cli/90a54f4a65a0bfa7d1dc7c651467c1d1b19a009ccbb071ec4ccae42ba903c811'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-mysql-5.7/b1576d316b0046ec60cbbc3ef148eed266daca19992d5b228167a7dfb7059c34'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-mariadb/f66c894e04cf0b91155bf3a3c0af46ff3ce6957ea5f2c07112ba3ead4a185513'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'postgres-10/e3f2ed31116e1a0c929ae6fcdde983a9d6c000c25cafde8a784fd126e06400f9'... Skipped [Package already compiled] (00:00:00)
  Compiling package 's3cli/93d30c08e76d18cf878007359b18c1d1c1c0fb92c757d06bb0bb09de60f2c765'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'verify_multidigest/ffa02c5cc46c56c8006a5c081a16e76b4353f99de7ccc1605c01a95ae47f2fbd'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'health_monitor/5a419aae8750e7fe3f368f6695f8c60fc7d80e8a547d542137d6fbf782cee7fa'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'director/31ce6b1831288b9080178caf68f40d7c59d0743b2f736b449aab842d199fbc4c'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'uaa/2210f02ea85373965968f01d0291a1208d4b6e2e85616a95b477a4354cb93674'... Skipped [Package already compiled] (00:00:03)
  Compiling package 'nginx/82a22b536cf378d354f9325dadcbcb2fa70b1ce9e37eb65a8a7a97cd35e8fc45'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer/84b24a5d9b0a1c07b6484bf908700e2d7990b718e4fd2ce5ee4545337109df2f'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'bosh-gcscli/6394d55f449cad79d0f825815777c3f9f06efcae67850796e905e6aab7e9335b'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'postgres-13/a3141b9f3664abe145c6fb452a54b3bbc4b772933083c2c1ef725c0a7c71824f'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'database-backup-restorer-postgres-10/f4a7d1e2aaad5f2aabb6b0dcbcaedb49305f0d62373af72e2ee8f01eaa595be9'... Skipped [Package already compiled] (00:00:00)
  Updating instance 'bosh/0'... Failed (00:04:49)
Failed deploying (00:26:41)

Cleaning up rendered CPI jobs... Finished (00:00:00)

Deploying:
  Running the pre-start script:
    Sending 'get_task' to the agent:
      Agent responded with error: Action Failed get_task: Task 288aece3-c64b-4578-5bf7-c6a7c8058142 result: 1 of 8 pre-start scripts failed. Failed Jobs: postgres. Successful Jobs: blobstore, nats, bpm, director, user_add, credhub, uaa.

Exit code 1

The pre-start script of the postgres job failed.

Expected behavior BOSH Director should be successfully upgraded from v271.2.0 to v280.0.14

Logs When sshing into the BOSH Director VM, I found this error in /var/vcap/sys/log/postgres/pre-start.stdout.log:

bosh/0:~$ sudo -i
bosh/0:~# monit summary
/var/vcap/bosh/etc/monitrc:8: Warning: include files not found '/var/vcap/monit/job/*.monitrc'
The Monit daemon 5.2.5 uptime: 20m 

System 'system_8c7a4cee-d163-4cd5-4d8c-cd2c5d15cd6f' running

bosh/0:~# ls /var/vcap/sys/log/postgres/ -hal
total 12K
drwxrwx---  2 root vcap 4.0K Jan 27 05:44 .
drwxr-x--- 16 root vcap 4.0K Jan 27 05:44 ..
-rw-r-----  1 root root    0 Jan 27 05:44 pre-start.stderr.log
-rw-r-----  1 root root  283 Jan 27 05:44 pre-start.stdout.log

bosh/0:~# cat /var/vcap/sys/log/postgres/pre-start.stderr.log 

bosh/0:~# cat /var/vcap/sys/log/postgres/pre-start.stdout.log 
kernel.shmmax = 67108864
copying contents of postgres-10 to postgres-15 for postgres upgrade...
Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok

The source cluster was not shut down cleanly.
Failure, exiting

When BOSH Director is migrating the database from Postgres 10 to Postgres 15 during the upgrade, it's complaining about the source database (Postgres 10?) is not shutdown cleanly. I attempted to rerun the BOSH Director upgrade several times, but it did not help.

Versions (please complete the following information):

Infrastructure: vSphere
BOSH versions: from 271.2.0 to 280.0.14
BOSH CLI version: $ bosh -v version 6.1.1-a0c78bc2-2019-10-25T22:16:25Z Succeeded
Stemcell versions: ubuntu-bionic/1.92 for current BOSH Director v271.2.0 ubuntu-jammy/1.340 for new BOSH Director v280.0.14

... other versions of releases being used (BOSH DNS, Credhub, UAA, BPM, etc)

yq '.releases' releases-280.0.14/interpolated-bosh-director-280.0.14.yml 
- name: bosh
sha1: f7fd9b040ab56b9c88dd6c4dfc23fdf682c7d4ad
url: https://s3.amazonaws.com/bosh-compiled-release-tarballs/bosh-280.0.14-ubuntu-jammy-1.340-20240111-153544-517049233-20240111153545.tgz
version: 280.0.14
- name: bpm
sha1: 6ac7f9a016075ed69b6808dfb544146a73565a9f
url: https://s3.amazonaws.com/bosh-compiled-release-tarballs/bpm-1.2.13-ubuntu-jammy-1.340-20240110-224040-652943252-20240110224041.tgz
version: 1.2.13
- name: bosh-vsphere-cpi
sha1: ddcf851983f672b1186590244d94f7dffb959ff2
url: https://bosh.io/d/github.com/cloudfoundry/bosh-vsphere-cpi-release?v=97.0.5
version: 97.0.5
- name: uaa
sha1: a8d7847cf4b5829bcfc085565dfb78697fbc3bb5
url: https://s3.amazonaws.com/bosh-compiled-release-tarballs/uaa-76.31.0-ubuntu-jammy-1.340-20240119-145417-377757494-20240119145421.tgz
version: 76.31.0
- name: credhub
sha1: e9229b2bb5681f9ef8911e653e9719de628b3904
url: https://s3.amazonaws.com/bosh-compiled-release-tarballs/credhub-2.12.58-ubuntu-jammy-1.340-20240111-190030-621523752-20240111190032.tgz
version: 2.12.58
- name: os-conf
sha1: daf34e35f1ac678ba05db3496c4226064b99b3e4
url: https://bosh.io/d/github.com/cloudfoundry/os-conf-release?v=22.2.1
version: 22.2.1
- name: backup-and-restore-sdk
sha1: 28ea9cbf00d89d4d4c363f4459d79268e44ac65f
url: https://s3.amazonaws.com/bosh-compiled-release-tarballs/backup-and-restore-sdk-1.18.116-ubuntu-jammy-1.340-20240115-082356-879977937-20240115082400.tgz
version: 1.18.116

Deployment info: We're using "bosh create-env" command with bosh-deployment to create and upgrade BOSH Director environment. BOSH Director creation script:

#!/usr/bin/env bash

BIN_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd -P)"

if [[ $# -lt 2 ]]
then
  echo "Usage: $0 <env_name> <bosh_director_version>" 1>&2
  echo "Example: $0 sandbox-cfar 280.0.14" 1>&2
  exit 1
fi

env_name=${1}
bosh_director_version=${2}

bosh create-env ${BIN_DIR}/bosh-deployment-${bosh_director_version}/bosh.yml \
    --state=${BIN_DIR}/${env_name}-state.json \
    --vars-store=${BIN_DIR}/${env_name}-creds.yml \
    -l ${BIN_DIR}/${env_name}-vars-${bosh_director_version}.yml \
    -o ${BIN_DIR}/bosh-deployment-${bosh_director_version}/vsphere/cpi.yml \
    -o ${BIN_DIR}/bosh-deployment-${bosh_director_version}/uaa.yml \
    -o ${BIN_DIR}/bosh-deployment-${bosh_director_version}/credhub.yml \
    -o ${BIN_DIR}/bosh-deployment-${bosh_director_version}/jumpbox-user.yml \
    -o ${BIN_DIR}/bosh-deployment-${bosh_director_version}/bbr.yml \
    -o ${BIN_DIR}/bosh-deployment-${bosh_director_version}/experimental/enable-metrics.yml \
    -o ${BIN_DIR}/ops/configure-uaa-ldap.yml \
    -o ${BIN_DIR}/ops/change-uaa-login-prompt.yml \
    -o ${BIN_DIR}/ops/map-ldap-to-uaa-groups.yml \
    -o ${BIN_DIR}/ops/use-bosh-compiled-releases-from-artifactory-${bosh_director_version}.yml \
    -o ${BIN_DIR}/ops/use-bosh-stemcell-from-artifactory-${bosh_director_version}.yml \
    -o ${BIN_DIR}/ops/vsphere.yml \
    -o ${BIN_DIR}/ops/dns.yml \
    -o ${BIN_DIR}/ops/ntp.yml \
    -o ${BIN_DIR}/ops/passwd.yml \
    -o ${BIN_DIR}/ops/disk-pools.yml \
    -o ${BIN_DIR}/ops/set-credhub-minimum-certificate-duration.yml

new bosh-deployment: https://github.com/cloudfoundry/bosh-deployment/tree/15cbd254db78ab49ef957f2d80ffd2901b09d6e5

Additional context Add any other context about the problem here.

rkoster commented 8 months ago

It seems like you are upgrading from an ancient version of Postgres. This issue was fixed here: https://github.com/cloudfoundry/bpm-release/pull/152

phong2tran commented 8 months ago

Thank you so much for the response @rkoster! Indeed we're operating an "outdated" BOSH environment and have not done the upgrade regularly as we should. We have seen this issue intermittently on a few runs of BOSH Director upgrade testing.

How can we move forward with this BOSH Director v280.0.14 upgrade and ensure that this issue won't happen in our existing production BOSH environments?

Option 1: Can we first manually shut down Postgres 10 on the BOSH Director VM before attempting BOSH Director upgrade? If yes, which command sequences should be used to properly shut down Postgres 10 and other BOSH Director related services?

Option 2: First update BPM component to v1.1.14 or higher (https://github.com/cloudfoundry/bpm-release/pull/152#issuecomment-938235720) with the fix on current BOSH Director v271.2.0 before upgrading to BOSH Director v280.0.14.

Any other options? Greatly appreciate your suggestions here.

rkoster commented 8 months ago

Updating BPM would still be an update of the instance, and as such have a change of an improper Postgres shutdown.

@bgandon do you remember if there was a workaround that was used before the fix was implemented?

phong2tran commented 7 months ago

Hi @bgandon, As @rkoster confirmed using Option 2 will likely run into the same improper Postgres shutdown. Could you please advice on the workaround you used before the BPM fix was implemented if it's possible?

We're thinking of using the Option 1 as a workaround for manually shutting down Postgres 10 on the BOSH Director VM before attempting BOSH Director upgrade. Please help to confirm if the following steps will work.

SSH into BOSH Director VM.

Monit stop all other processes except Postgres.

bosh/0:~# for name in "credhub" "uaa" "health_monitor" "director_nginx" "director_sync_dns" "director_scheduler" "blobstore_nginx" "nats" "director"; do monit stop "${name}"; done


bosh/0:~# monit summary
The Monit daemon 5.2.5 uptime: 7d 2h 19m

Process 'nats' not monitored Process 'postgres' running Process 'blobstore_nginx' not monitored Process 'director' not monitored Process 'worker_1' not monitored Process 'worker_2' not monitored Process 'worker_3' not monitored Process 'worker_4' not monitored Process 'director_scheduler' not monitored Process 'director_sync_dns' not monitored Process 'director_nginx' not monitored Process 'health_monitor' not monitored Process 'uaa' not monitored Process 'credhub' not monitored System 'system_be0914a6-1473-47f1-58d9-4f3aacbe2ab5' running

3. Umonitor Postgres process, so monit won't restart it when Postgres is shutdown using "kill" command directly later.

bosh/0:~# monit unmonitor postgres

bosh/0:~# monit summary The Monit daemon 5.2.5 uptime: 7d 2h 54m

Process 'nats' not monitored Process 'postgres' not monitored Process 'blobstore_nginx' not monitored Process 'director' not monitored Process 'worker_1' not monitored Process 'worker_2' not monitored Process 'worker_3' not monitored Process 'worker_4' not monitored Process 'director_scheduler' not monitored Process 'director_sync_dns' not monitored Process 'director_nginx' not monitored Process 'health_monitor' not monitored Process 'uaa' not monitored Process 'credhub' not monitored System 'system_be0914a6-1473-47f1-58d9-4f3aacbe2ab5' running

4. Shutdown Postgres using "kill" command with SIGINT signal for fast mode shutdown.

bosh/0:~# postgres_pid=$(/var/vcap/packages/bpm/bin/bpm pid postgres-10) && kill -s SIGINT "${postgres_pid}"

5. Check Postgres database cluster state and ensure it's been shutting down properly with "shut down" state instead of "in production"

bosh/0:~# su - vcap -c "/var/vcap/packages/postgres-10/bin/pg_controldata -D /var/vcap/store/postgres-10" | grep -F "Database cluster state" Database cluster state: shut down


6. If Postgres database cluster state is in "shut down", then exit the BOSH Director VM and proceed with the BOSH Director upgrade as usual.

cloudfoundry / bosh

Failed on upgrading BOSH Director from v271.2.0 to v280.0.14 #2490