Closed Malsourie closed 1 year ago
We have tested the same resizing with bosh version 277.1, everything works fine. It seems to be an issue in 277.2 release.
Logs:
Using deployment 'debug'
instance_groups:
- name: debug
- persistent_disk_type: service_fabrik_hdd_1gb
+ persistent_disk_type: service_fabrik_hdd_2gb
- vm_type: service_fabrik_vm_micro
+ vm_type: service_fabrik_vm_small
Continue? [yN]: y
Task 16
Task 16 | 12:41:32 | Preparing deployment: Preparing deployment (00:00:01)
Task 16 | 12:41:33 | Preparing deployment: Rendering templates (00:00:00)
Task 16 | 12:41:33 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 16 | 12:41:33 | Updating instance debug: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:33 | L executing pre-stop: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:33 | L executing drain: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:34 | L stopping jobs: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:41:35 | L executing post-stop: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:15 | L installing packages: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:16 | L configuring jobs: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:16 | L executing pre-start: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:49 | L starting jobs: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary)
Task 16 | 12:44:54 | L executing post-start: debug/d09b7f4c-2ab1-4f20-a958-995947966509 (0) (canary) (00:03:22)
Task 16 Started Thu Mar 9 12:41:32 UTC 2023
Task 16 Finished Thu Mar 9 12:44:55 UTC 2023
Task 16 Duration 00:03:23
Task 16 done
Succeeded
We have done several more tests and have following findings:
umount /var/vcap/store && mount <dev_path> /var/vcap/store -o ro
and got the same error mount: /var/vcap/store: /dev/sdc1 already mounted or mount point busy
. So in general on all recreated vms the disk cannot be remounted as read-only anymore.director.enable_short_lived_nats_bootstrap_credentials = false
to deactivate the feature, then everything works fine again.settings.json
file in /var/vcap/bosh
, if we activate the new feature, the mbus
certificates will be rotated after the new vm_type is created during a scale-up process. If we deactivate the new feature, the mbus
certificate will not be changed on the new vm.nats.log
file in /var/vcap/sys/log
, and see warnings: [7] 2023/03/15 14:52:30.766170 [WRN] User "CN = <agent_id>.bootstrap.agent.bosh-internal" found in connect proto, but user required from cert
Does that matter?|| force_nats_rotation
from that line, which was introduced in the commit, and create a dev release. Then everything works fine again.enable_short_lived_nats_bootstrap_credentials
, should column permanent_nats_credentials
be false
according to the feature and column name? But indeed they always have same boolean value. In the code: certificate {agent_id}.bootstrap.agent.bosh-internal
will be created when an vm is created: https://github.com/cloudfoundry/bosh/blob/dec31de320fcd29a574db8685f6abf697138f788/src/bosh-director/lib/bosh/director/deployment_plan/steps/create_vm_step.rb#L121. But if we activate the new feature by default, then permanent_nats_credentials
will be true
, the bootstrap.agent.bosh-agent
will not be added into nats auth.json file: https://github.com/cloudfoundry/bosh/blob/dec31de320fcd29a574db8685f6abf697138f788/src/bosh-nats-sync/lib/nats_sync/nats_auth_config.rb#L51-L53 Then the initial cert would not be trusted by nats, is that right?If we look into this two test cases
They should provide answers to the questions above. We seem to have an issue with the rotation because the permanent_nats_credentials
has the value of the feature configuration which is true
when the feature is activate but the table colum has an inverted logic. @nouseforaname can you confirm this.
I'm wondering why this prevents us from migrating the disk. Is there a process which blocks the (un)mount operation?
If we look into this two test cases
- https://github.com/cloudfoundry/bosh/blob/main/src/bosh-director/spec/unit/deployment_plan/steps/update_instance_settings_step_spec.rb#L49-L68
- https://github.com/cloudfoundry/bosh/blob/main/src/bosh-nats-sync/spec/nats_sync/nats_auth_config_spec.rb#L57-L97
They should provide answers to the questions above. We seem to have an issue with the rotation because the
permanent_nats_credentials
has the value of the feature configuration which istrue
when the feature is activate but the table colum has an inverted logic. @nouseforaname can you confirm this.I'm wondering why this prevents us from migrating the disk. Is there a process which blocks the (un)mount operation?
Thank you for your answer @beyhan !
We have checked the process via command lsof
and ps aux
, but did not find process doing writing on the device and the /var/vcap/store
directory. There is a builtin option in mount
command -o remount
, I tried it and then the disk can be mounted as read-only. The command I have executed: mount <dev> /var/vcap/store -o remount,ro
i finally replicated the issue (a stupid typo on my part)
setup a bosh met bbl deploy following debug manifest
and than change the vm_type
and persistent_disk_type
name: debug
instance_groups:
- azs: [z1]
instances: 1
jobs: []
name: debug
networks:
- name: default
stemcell: default
# vm_type: default
# persistent_disk_type: 1GB
vm_type: medium
persistent_disk_type: 5GB
env:
bosh:
# c1oudc0w
password: "$6$3RO2Vvl4EXS2TMRD$IaNjbMHYCSBiQLQr0PKK8AdfDHTsNunqh3kO7USouNS/tWAvH0JmtDfrhLlHwN0XUCUrBVpQ02hoHYgTdaaeY1"
authorized_keys: [((ssh.public_key))]
remove_static_libraries: true
releases: []
variables:
- name: ssh
type: ssh
stemcells:
- alias: default
os: ubuntu-jammy
version: latest
update:
canaries: 2
canary_watch_time: 5000-60000
max_in_flight: 1
update_watch_time: 5000-60000
with normal disk migration i get the error
Error: Action Failed get_task: Task 900c2798-627e-427d-53d5-e8b884ae938e result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/nvme2n1p1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/nvme2n1p1 already mounted or mount point busy.
': exit status 32
with native disk resize enabled i get the following error. and will result in a deployment that is now missing a disk so reverting the disk and vm type will result in an error where it says its missing the persistent disk
Error: Unknown CPI error 'Unknown' with message 'You are not authorized to perform this operation. Encoded authorization failure message: OTlCAyfAu1HM69a63Tp2VF4A2RWDzyJiVLcMlfV54yl4c0GNzg1A0CeyofbmfiuU_mBj-duSg674AWpaEt7bta3qn6UIIexr4aa23FsLN-1a1AOm_TsB0-Y64FyPYeYBHoH3e_asaY94UWnYLf_4B5oLNtFSaDGAENyFIMOykPLYoMoXeMuUuDTcDSRTFzBvD0cIVyqCUL6M8lYfKpe1vdrJI_i6wbHbbpgX78UERsMn-S-8DoffFIxduflcjmFGnNuwe5ACzs0y9sBQqir345Cp3KVg2ecS2jZ7YunJzxuyz6us7iQ2mT-ng5StA3e9eH8COzJ2jir5u5v2KOpudGELX0fQy5ALoTZrfaKtKNn_acSJ9q6m_1BlvmkIoaR_cVLE77DsBv1DU-qcMaIgbyIipMBS0Jxi_uUS7ITtACpkUieA191XWu5TKP9Uo0OtV7QEdrJyHth-zlbGIoKEHW1bcZuLB4CyxpX1KByksqwYzossmfo6gTXHflR8UEkbYpjlFx2KXN3cYgJpr2p3LvweuK56YtF3Zay-uUacqe6gTkGl0BPV0cksf2xjEf3EyQa9nZB2K1xBKkfSKF___u6PkJQuTTiO34VOK17a58MoNUcg69l2ETD-HCzvPCAdvmfmwH04W9XzIGTRqni2So7gRmRlDKN7R1kNTYsArTTjiDghQKsG1hI1ixTXgf-6h2UQQINXlYw26jrZsPQ0aah29guJiTloTYRmGKPJ496PVXKBW1N2BWpcI9LqFDUzn7oGEFN9VbmWfkFlAbbwZ2WOBWwLkMJxu52w7txLbmvNyjhUwmX6ndjE152UlH5SXq74vPoSGhK9On2ZOsLSV2BEtfcu7DkVOaYFXDyQ6S-F3HfJPMO_IbSST8oYkVl-DZoGFzKudrPBvvpC8rTY7qmgJLGLnA' in 'resize_disk' CPI method (CPI request ID: 'cpi-202758')
Some findings from today. You don't need to recreate the VM at all.
Steps so far:
sv stop agent && sv start agent
umount /var/vcap/store
fuser
and lsof
seem to show no files using the device or mount point.
Was unable to reproduce the issue on Bionic with the same steps. Starting to feel like some sort of kernel bug introduced in jammy.
Found this in the kernel log, but didn't find much on it: sdb1: Can't mount, would change RO state
Was unable to reproduce it without restarting the agent. I grabbed all the Running command
output from agent logs during a restart then tried running all of those myself but wasn't able to trigger the problem.
It seems to happen as soon as the agent is started up. If I sv start agent
, then within a couple of seconds try to umount/mount the device as read-only, I get the error.
However, in one test I did it all in a single shell command sv start agent; umount /var/vcap/store; mount PERSISTENT_DISK_DEVICE_ID /var/vcap/store -o ro
and this worked. However, I was then unable to mount it as read/write. So it almost seems to get stuck in whatever state it is in.
The new feature is triggering this because it updates the NATS certs as part of the deploy, which triggers an agent restart.
it seems that if we use remount it works.
mount -o remount,ro
this could be an easy fix in the agent. by changing this line
https://github.com/cloudfoundry/bosh-agent/blob/main/platform/disk/linux_mounter.go#L78
also notice the 2 different errors if we use native resized disk vs migrating disk. error
@beyhan , thanks for clarifying the questions for @Malsourie . Regarding this finding:
They should provide answers to the questions above. We seem to have an issue with the rotation because the
permanent_nats_credentials
has the value of the feature configuration which istrue
when the feature is activate but the table colum has an inverted logic. @nouseforaname can you confirm this.and From my understanding if we activate the feature enable_short_lived_nats_bootstrap_credentials, should column permanent_nats_credentials be false according to the feature and column name? But indeed they always have same boolean value. In the code: certificate {agent_id}.bootstrap.agent.bosh-internal will be created when an vm is created:
The intention of the column in the VM table
permanent_nats_credentials
is to confirm that for that particular VM the short lived nats credentials have been rotated and to know if that VM is now using the permanent credentials or not. This flag is different from the global flag received in the manifestenable_short_lived_nats_bootstrap_credentials
which indicates if new deployments are going to have the new feature (rotate the short lived bootstrap credential) enabled or not. Thanks for noticing that this could be ambiguous. As mentioned by @jpalermo , the problem when remounting a disk is not directly caused by the new feature but by the restart that it causes on the agent. We are looking at it.
We tracked the problem down to the systemctl restart chrony.service
call in the sync-time
script, which was probably at the very bottom of our "things that might be causing this" list.
We don't really understand "why" this causes the behavior we're seeing, so if anybody has any thoughts please let us know.
Assuming the change doesn't cause problems, it should be in the next version of the Jammy stemcell.
should we add an extra intergration test in our bosh acceptance tests for this scenario? upgrading vm and disk type in one go?
@danielfor Thank you for your explanation!
We observed that the issue occurs after we update jammy from 1.83 to 1.93. Maybe it is caused by commits in between.
@Malsourie, can you test with Jammy 1.95, the fix should be included there
We tested with Jammy 1.95, the problem does not happen anymore.
Hi @jpalermo,
We rolled this new stemcell 1.95 out to our landscapes but during the migration from 1.93 to 1.95 we faced again an issue with the mounting of a disk.
Task 563876 | 10:49:31 | L executing post-stop: xxxxxxxxxxxxxxx (0) (canary) (00:00:41)
L Error: Action Failed get_task: Task 01f2fb88-4a9d-42b9-7fef-9a9bbac751ec result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/nvme2n1p1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/nvme2n1p1 already mounted or mount point busy.
': exit status 32
Do you have an idea how we can migrate to the new stemcell without running into the issue again?
Hi,
I go for the option from mount
command -o remount
whenever we do remounting. Since the issue is caused by restarting agent, it could also happen during upgrading to jammy 1.95
. We observed:
1.93
1.95
, the disk_type changes, then we face the same issue again.In current fix we can only be sure the issue won't happen anymore if the vm and disk is created directly on jammy 1.95
.
We'll see if we can reproduce this in an upgrade situation. That shouldn't matter since the disk migration and mounting/remounting should only be happening on the newly created VM.
This makes me think we're not fully understanding the problem.
We've been unable to reproduce the problem using the following steps:
However, @rkoster pointed me at this PR which was to fix this issue in Bionic. This seems like the exact problem we're seeing here.
This got removed in some of the Chrony cleanup work for Jammy done here.
@ramonskie, not sure if you remember that original work, but it would be good to get your thoughts here
Hi @jpalermo, could you please execute sv stop agent && sv start agent
on that VM between step "Deployed a GCP 1.93 VM with a persistent disk" and step "Changed the size of the disk and updated the stemcell to 1.95", in order to simulate that the agent is restarted because of password/certificate rotation?
As we are using the new short lived NATS credentials in our environments, the agent is restarted within the first minute of both VMs being created which we can see in the agent logs:
grep "Starting agent" /var/vcap/bosh/log/current
2023-03-30_21:29:53.68340 [main] 2023/03/30 21:29:53 DEBUG - Starting agent
2023-03-30_21:30:18.70624 [main] 2023/03/30 21:30:18 DEBUG - Starting agent
I did do an extra agent restart though on the 1.93 VM just to be sure though. So:
Everything seemed to work fine.
@jpalermo i do remember this yes. it took a long time to find the cause. but it seems @cunnie removed it in this commit https://github.com/cloudfoundry/bosh-linux-stemcell-builder/commit/248de3dbbdfb1db84cf07f99dcedcc3f6287855d lucky for us he is pretty descriptive in his commits messages :)
Closing since all changes have been merged.
Describe the bug We are trying to resize vm_type and persistence disk type on a single instance but got an error message:
Error: Action Failed get_task: Task 9e2690af-bb70-4516-7b1c-fe55ea420740 result: Migrating persistent disk: Remounting persistent disk as readonly: Shelling out to mount: Running command: 'mount /dev/sdc1 /var/vcap/store -o ro', stdout: '', stderr: 'mount: /var/vcap/store: /dev/sdc1 already mounted or mount point busy.
To Reproduce Steps to reproduce the behavior (example):
Expected behavior The vm and attached disk is resized
Logs
Versions (please complete the following information):
Deployment info: manifest: