Resizing persistent disk and vm_type fails in one single deploy execution

Malsourie commented 1 year ago

Describe the bug We are trying to resize vm_type and persistence disk type on a single instance but got an error message: Error: Action Failed get_task: Task 9e2690af-bb70-4516-7b1c-fe55ea420740 result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/sdc1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/sdc1 already mounted or mount point busy.

To Reproduce Steps to reproduce the behavior (example):

Deploy a bosh director
Upload stemcell and cloud-config
Deploy manifest
Change the vm_type and disk_type in the manifest
Deploy modified manifest

Expected behavior The vm and attached disk is resized

Logs

Using deployment 'debug'

  instance_groups:
  - name: debug
-   persistent_disk_type: service_fabrik_hdd_1gb
+   persistent_disk_type: service_fabrik_hdd_2gb
-   vm_type: service_fabrik_vm_micro
+   vm_type: service_fabrik_vm_small

Continue? [yN]: y

Task 1011481

Task 1011481 | 10:47:02 | Preparing deployment: Preparing deployment (00:00:00)
Task 1011481 | 10:47:02 | Preparing deployment: Rendering templates (00:00:00)
Task 1011481 | 10:47:02 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 1011481 | 10:47:03 | Updating instance debug: debug/620ce8c5-c534-4778-ab05-e04a0f490df6 (0) (canary)
Task 1011481 | 10:47:03 | L executing pre-stop: debug/620ce8c5-c534-4778-ab05-e04a0f490df6 (0) (canary)
Task 1011481 | 10:47:03 | L executing drain: debug/620ce8c5-c534-4778-ab05-e04a0f490df6 (0) (canary)
Task 1011481 | 10:47:04 | L stopping jobs: debug/620ce8c5-c534-4778-ab05-e04a0f490df6 (0) (canary)
Task 1011481 | 10:47:05 | L executing post-stop: debug/620ce8c5-c534-4778-ab05-e04a0f490df6 (0) (canary) (00:05:14)
                        L Error: Action Failed get_task: Task 9e2690af-bb70-4516-7b1c-fe55ea420740 result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/sdc1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/sdc1 already mounted or mount point busy.
': exit status 32
Task 1011481 | 10:52:17 | Error: Action Failed get_task: Task 9e2690af-bb70-4516-7b1c-fe55ea420740 result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/sdc1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/sdc1 already mounted or mount point busy.
': exit status 32

Task 1011481 Started  Thu Mar  9 10:47:02 UTC 2023
Task 1011481 Finished Thu Mar  9 10:52:17 UTC 2023
Task 1011481 Duration 00:05:15
Task 1011481 error

Updating deployment:
  Expected task '1011481' to succeed but state is 'error'

Exit code 1

Versions (please complete the following information):

Infrastructure: [AWS, GCP, Azure]
BOSH version [277.2]
BOSH CLI version [7.0.1]
Stemcell version [e.g. ubuntu-Jammy/1.83]

Deployment info: manifest:

name: debug
instance_groups:
- azs: [z1]
  instances: 1
  jobs: []
  name: debug
  persistent_disk_type: <disk_type>
  networks:
  - name: sf_compilation
  stemcell: default
  vm_type: <vm_type>
  env:
    bosh:
      # c1oudc0w
      password: "$6$3RO2Vvl4EXS2TMRD$IaNjbMHYCSBiQLQr0PKK8AdfDHTsNunqh3kO7USouNS/tWAvH0JmtDfrhLlHwN0XUCUrBVpQ02hoHYgTdaaeY1"
      authorized_keys: [((ssh.public_key))]
      remove_static_libraries: true
releases: []
variables:
- name: ssh
  type: ssh
stemcells:
- alias: default
  os: ubuntu-jammy
  version: latest
update:
  canaries: 2
  canary_watch_time: 5000-60000
  max_in_flight: 1
  update_watch_time: 5000-60000

Malsourie commented 1 year ago

We have tested the same resizing with bosh version 277.1, everything works fine. It seems to be an issue in 277.2 release.

Logs:

Using deployment 'debug'

  instance_groups:
  - name: debug
-   persistent_disk_type: service_fabrik_hdd_1gb
+   persistent_disk_type: service_fabrik_hdd_2gb
-   vm_type: service_fabrik_vm_micro
+   vm_type: service_fabrik_vm_small

Continue? [yN]: y

Task 16

Task 16 | 12:41:32 | Preparing deployment: Preparing deployment (00:00:01)
Task 16 | 12:41:33 | Preparing deployment: Rendering templates (00:00:00)
Task 16 | 12:41:33 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 16 | 12:41:33 | Updating instance debug: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:33 | L executing pre-stop: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:33 | L executing drain: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:34 | L stopping jobs: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:35 | L executing post-stop: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:15 | L installing packages: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:16 | L configuring jobs: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:16 | L executing pre-start: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:49 | L starting jobs: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:54 | L executing post-start: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary) (00:03:22)

Task 16 Started  Thu Mar  9 12:41:32 UTC 2023
Task 16 Finished Thu Mar  9 12:44:55 UTC 2023
Task 16 Duration 00:03:23
Task 16 done

Succeeded

Malsourie commented 1 year ago

We have done several more tests and have following findings:

We tested to scale up only the disk_type and vm_type, both cases are successful. But we figure out that if the vm_type is scaled up, it is not possible to scale up the disk_type anymore. We ssh to that scaled up vm and execute umount /var/vcap/store && mount <dev_path> /var/vcap/store -o ro and got the same error mount: /var/vcap/store: /dev/sdc1 already mounted or mount point busy. So in general on all recreated vms the disk cannot be remounted as read-only anymore.
We create dev releases of bosh commit by commit and find out the commit which causes the issue: https://github.com/cloudfoundry/bosh/commit/dec31de320fcd29a574db8685f6abf697138f788
We use the parameter introduced by this commit: director.enable_short_lived_nats_bootstrap_credentials = false to deactivate the feature, then everything works fine again.
We checked the settings.json file in /var/vcap/bosh, if we activate the new feature, the mbus certificates will be rotated after the new vm_type is created during a scale-up process. If we deactivate the new feature, the mbus certificate will not be changed on the new vm.
We checked the nats.log file in /var/vcap/sys/log, and see warnings: [7] 2023/03/15 14:52:30.766170 [WRN] User "CN = <agent_id>.bootstrap.agent.bosh-internal" found in connect proto, but user required from cert Does that matter?
We also find the line of the code which causes the issue: https://github.com/cloudfoundry/bosh/blob/8c81d7322296ce2baf69a8c43d02af82237f9611/src/bosh-director/lib/bosh/director/deployment_plan/instance.rb#L204 We remove || force_nats_rotation from that line, which was introduced in the commit, and create a dev release. Then everything works fine again.
When I read the through the commit changes, I am also confused about this line: https://github.com/cloudfoundry/bosh/blob/dec31de320fcd29a574db8685f6abf697138f788/src/bosh-director/lib/bosh/director/deployment_plan/steps/update_instance_settings_step.rb#L15 From my understanding if we activate the feature enable_short_lived_nats_bootstrap_credentials, should column permanent_nats_credentials be false according to the feature and column name? But indeed they always have same boolean value. In the code: certificate {agent_id}.bootstrap.agent.bosh-internal will be created when an vm is created: https://github.com/cloudfoundry/bosh/blob/dec31de320fcd29a574db8685f6abf697138f788/src/bosh-director/lib/bosh/director/deployment_plan/steps/create_vm_step.rb#L121. But if we activate the new feature by default, then permanent_nats_credentials will be true, the bootstrap.agent.bosh-agent will not be added into nats auth.json file: https://github.com/cloudfoundry/bosh/blob/dec31de320fcd29a574db8685f6abf697138f788/src/bosh-nats-sync/lib/nats_sync/nats_auth_config.rb#L51-L53 Then the initial cert would not be trusted by nats, is that right?

beyhan commented 1 year ago

If we look into this two test cases

They should provide answers to the questions above. We seem to have an issue with the rotation because the permanent_nats_credentials has the value of the feature configuration which is true when the feature is activate but the table colum has an inverted logic. @nouseforaname can you confirm this.

I'm wondering why this prevents us from migrating the disk. Is there a process which blocks the (un)mount operation?

Malsourie commented 1 year ago

If we look into this two test cases

https://github.com/cloudfoundry/bosh/blob/main/src/bosh-director/spec/unit/deployment_plan/steps/update_instance_settings_step_spec.rb#L49-L68

https://github.com/cloudfoundry/bosh/blob/main/src/bosh-nats-sync/spec/nats_sync/nats_auth_config_spec.rb#L57-L97

They should provide answers to the questions above. We seem to have an issue with the rotation because the permanent_nats_credentials has the value of the feature configuration which is true when the feature is activate but the table colum has an inverted logic. @nouseforaname can you confirm this.

I'm wondering why this prevents us from migrating the disk. Is there a process which blocks the (un)mount operation?

Thank you for your answer @beyhan !

We have checked the process via command lsof and ps aux, but did not find process doing writing on the device and the /var/vcap/store directory. There is a builtin option in mount command -o remount, I tried it and then the disk can be mounted as read-only. The command I have executed: mount <dev> /var/vcap/store -o remount,ro

ramonskie commented 1 year ago

i finally replicated the issue (a stupid typo on my part) setup a bosh met bbl deploy following debug manifest and than change the vm_type and persistent_disk_type

name: debug
instance_groups:
- azs: [z1]
  instances: 1
  jobs: []
  name: debug
  networks:
  - name: default
  stemcell: default
  # vm_type: default
  # persistent_disk_type: 1GB
  vm_type: medium
  persistent_disk_type: 5GB
  env:
    bosh:
      # c1oudc0w
      password: "$6$3RO2Vvl4EXS2TMRD$IaNjbMHYCSBiQLQr0PKK8AdfDHTsNunqh3kO7USouNS/tWAvH0JmtDfrhLlHwN0XUCUrBVpQ02hoHYgTdaaeY1"
      authorized_keys: [((ssh.public_key))]
      remove_static_libraries: true
releases: []
variables:
- name: ssh
  type: ssh
stemcells:
- alias: default
  os: ubuntu-jammy
  version: latest
update:
  canaries: 2
  canary_watch_time: 5000-60000
  max_in_flight: 1
  update_watch_time: 5000-60000

with normal disk migration i get the error

Error: Action Failed get_task: Task 900c2798-627e-427d-53d5-e8b884ae938e result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/nvme2n1p1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/nvme2n1p1 already mounted or mount point busy.
': exit status 32

with native disk resize enabled i get the following error. and will result in a deployment that is now missing a disk so reverting the disk and vm type will result in an error where it says its missing the persistent disk

Error: Unknown CPI error 'Unknown' with message 'You are not authorized to perform this operation. Encoded authorization failure message: OTlCAyfAu1HM69a63Tp2VF4A2RWDzyJiVLcMlfV54yl4c0GNzg1A0CeyofbmfiuU_mBj-duSg674AWpaEt7bta3qn6UIIexr4aa23FsLN-1a1AOm_TsB0-Y64FyPYeYBHoH3e_asaY94UWnYLf_4B5oLNtFSaDGAENyFIMOykPLYoMoXeMuUuDTcDSRTFzBvD0cIVyqCUL6M8lYfKpe1vdrJI_i6wbHbbpgX78UERsMn-S-8DoffFIxduflcjmFGnNuwe5ACzs0y9sBQqir345Cp3KVg2ecS2jZ7YunJzxuyz6us7iQ2mT-ng5StA3e9eH8COzJ2jir5u5v2KOpudGELX0fQy5ALoTZrfaKtKNn_acSJ9q6m_1BlvmkIoaR_cVLE77DsBv1DU-qcMaIgbyIipMBS0Jxi_uUS7ITtACpkUieA191XWu5TKP9Uo0OtV7QEdrJyHth-zlbGIoKEHW1bcZuLB4CyxpX1KByksqwYzossmfo6gTXHflR8UEkbYpjlFx2KXN3cYgJpr2p3LvweuK56YtF3Zay-uUacqe6gTkGl0BPV0cksf2xjEf3EyQa9nZB2K1xBKkfSKF___u6PkJQuTTiO34VOK17a58MoNUcg69l2ETD-HCzvPCAdvmfmwH04W9XzIGTRqni2So7gRmRlDKN7R1kNTYsArTTjiDghQKsG1hI1ixTXgf-6h2UQQINXlYw26jrZsPQ0aah29guJiTloTYRmGKPJ496PVXKBW1N2BWpcI9LqFDUzn7oGEFN9VbmWfkFlAbbwZ2WOBWwLkMJxu52w7txLbmvNyjhUwmX6ndjE152UlH5SXq74vPoSGhK9On2ZOsLSV2BEtfcu7DkVOaYFXDyQ6S-F3HfJPMO_IbSST8oYkVl-DZoGFzKudrPBvvpC8rTY7qmgJLGLnA' in 'resize_disk' CPI method (CPI request ID: 'cpi-202758')

jpalermo commented 1 year ago

Some findings from today. You don't need to recreate the VM at all.

Steps so far:

Deploy vm with persistent disk
Restart bosh agent sv stop agent && sv start agent
Unmout the disk umount /var/vcap/store
Mount the disk as read-only

fuser and lsof seem to show no files using the device or mount point.

Was unable to reproduce the issue on Bionic with the same steps. Starting to feel like some sort of kernel bug introduced in jammy.

Found this in the kernel log, but didn't find much on it: sdb1: Can't mount, would change RO state

Was unable to reproduce it without restarting the agent. I grabbed all the Running command output from agent logs during a restart then tried running all of those myself but wasn't able to trigger the problem.

It seems to happen as soon as the agent is started up. If I sv start agent, then within a couple of seconds try to umount/mount the device as read-only, I get the error.

However, in one test I did it all in a single shell command sv start agent; umount /var/vcap/store; mount PERSISTENT_DISK_DEVICE_ID /var/vcap/store -o ro and this worked. However, I was then unable to mount it as read/write. So it almost seems to get stuck in whatever state it is in.

The new feature is triggering this because it updates the NATS certs as part of the deploy, which triggers an agent restart.

ramonskie commented 1 year ago

it seems that if we use remount it works. mount -o remount,ro this could be an easy fix in the agent. by changing this line https://github.com/cloudfoundry/bosh-agent/blob/main/platform/disk/linux_mounter.go#L78

also notice the 2 different errors if we use native resized disk vs migrating disk. error

danielfor commented 1 year ago

@beyhan , thanks for clarifying the questions for @Malsourie . Regarding this finding:

They should provide answers to the questions above. We seem to have an issue with the rotation because the permanent_nats_credentials has the value of the feature configuration which is true when the feature is activate but the table colum has an inverted logic. @nouseforaname can you confirm this.

and From my understanding if we activate the feature enable_short_lived_nats_bootstrap_credentials, should column permanent_nats_credentials be false according to the feature and column name? But indeed they always have same boolean value. In the code: certificate {agent_id}.bootstrap.agent.bosh-internal will be created when an vm is created:

The intention of the column in the VM table permanent_nats_credentials is to confirm that for that particular VM the short lived nats credentials have been rotated and to know if that VM is now using the permanent credentials or not. This flag is different from the global flag received in the manifest enable_short_lived_nats_bootstrap_credentials which indicates if new deployments are going to have the new feature (rotate the short lived bootstrap credential) enabled or not. Thanks for noticing that this could be ambiguous. As mentioned by @jpalermo , the problem when remounting a disk is not directly caused by the new feature but by the restart that it causes on the agent. We are looking at it.

jpalermo commented 1 year ago

We tracked the problem down to the systemctl restart chrony.service call in the sync-time script, which was probably at the very bottom of our "things that might be causing this" list.

https://github.com/cloudfoundry/bosh-linux-stemcell-builder/commit/65fdd0f8cd8468980664f3afea634810a0575959

We don't really understand "why" this causes the behavior we're seeing, so if anybody has any thoughts please let us know.

Assuming the change doesn't cause problems, it should be in the next version of the Jammy stemcell.

ramonskie commented 1 year ago

should we add an extra intergration test in our bosh acceptance tests for this scenario? upgrading vm and disk type in one go?

Malsourie commented 1 year ago

@danielfor Thank you for your explanation!

We observed that the issue occurs after we update jammy from 1.83 to 1.93. Maybe it is caused by commits in between.

jpalermo commented 1 year ago

@Malsourie, can you test with Jammy 1.95, the fix should be included there

Malsourie commented 1 year ago

We tested with Jammy 1.95, the problem does not happen anymore.

max-soe commented 1 year ago

Hi @jpalermo,

We rolled this new stemcell 1.95 out to our landscapes but during the migration from 1.93 to 1.95 we faced again an issue with the mounting of a disk.

Task 563876 | 10:49:31 | L executing post-stop: xxxxxxxxxxxxxxx (0) (canary) (00:00:41)
                       L Error: Action Failed get_task: Task 01f2fb88-4a9d-42b9-7fef-9a9bbac751ec result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/nvme2n1p1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/nvme2n1p1 already mounted or mount point busy.
': exit status 32

Do you have an idea how we can migrate to the new stemcell without running into the issue again?

Malsourie commented 1 year ago

Hi, I go for the option from mount command -o remount whenever we do remounting. Since the issue is caused by restarting agent, it could also happen during upgrading to jammy 1.95. We observed:

The stemcell is successfully upgraded to 1.93
Password/certificate rotation triggered
During upgrading to 1.95, the disk_type changes, then we face the same issue again.

In current fix we can only be sure the issue won't happen anymore if the vm and disk is created directly on jammy 1.95.

jpalermo commented 1 year ago

We'll see if we can reproduce this in an upgrade situation. That shouldn't matter since the disk migration and mounting/remounting should only be happening on the newly created VM.

This makes me think we're not fully understanding the problem.

jpalermo commented 1 year ago

We've been unable to reproduce the problem using the following steps:

Deployed a GCP 1.93 VM with a persistent disk
Changed the size of the disk and updated the stemcell to 1.95
Redeployed

However, @rkoster pointed me at this PR which was to fix this issue in Bionic. This seems like the exact problem we're seeing here.

This got removed in some of the Chrony cleanup work for Jammy done here.

@ramonskie, not sure if you remember that original work, but it would be good to get your thoughts here

Malsourie commented 1 year ago

Hi @jpalermo, could you please execute sv stop agent && sv start agent on that VM between step "Deployed a GCP 1.93 VM with a persistent disk" and step "Changed the size of the disk and updated the stemcell to 1.95", in order to simulate that the agent is restarted because of password/certificate rotation?

jpalermo commented 1 year ago

As we are using the new short lived NATS credentials in our environments, the agent is restarted within the first minute of both VMs being created which we can see in the agent logs:

grep "Starting agent" /var/vcap/bosh/log/current
2023-03-30_21:29:53.68340 [main] 2023/03/30 21:29:53 DEBUG - Starting agent
2023-03-30_21:30:18.70624 [main] 2023/03/30 21:30:18 DEBUG - Starting agent

I did do an extra agent restart though on the 1.93 VM just to be sure though. So:

Deployed a GCP 1.93 VM with a persistent disk
Restarted the agent manually
Changed the size of the disk and updated the stemcell to 1.95
Redeployed

Everything seemed to work fine.

ramonskie commented 1 year ago

@jpalermo i do remember this yes. it took a long time to find the cause. but it seems @cunnie removed it in this commit https://github.com/cloudfoundry/bosh-linux-stemcell-builder/commit/248de3dbbdfb1db84cf07f99dcedcc3f6287855d lucky for us he is pretty descriptive in his commits messages :)

rkoster commented 1 year ago

Closing since all changes have been merged.

cloudfoundry / bosh

Resizing persistent disk and vm_type fails in one single deploy execution #2433