cloudfoundry / bosh-alicloud-cpi-release

BOSH Alibaba CPI
Apache License 2.0
32 stars 20 forks source link

Attach disk action causing intermittent failures when bosh registry is not present #137

Closed yatzek closed 2 years ago

yatzek commented 3 years ago

After removal of bosh registry from the config, we observe intermittent issues when attaching a persistent disk. We believe the reason for the issues is the cpi code, attach disk action, line 46:

https://github.com/cloudfoundry-incubator/bosh-alicloud-cpi-release/blob/47a510af1b73dea69987425c03033d1394b41d1a/src/bosh-alicloud-cpi/action/attach_disk.go#L46

Lines 46-63 are executed only if you do not have the bosh registry configured, the cpi stops the vm and then starts it again in line 92-109.

The vm is started for the first time in the create-vm step:

https://github.com/cloudfoundry-incubator/bosh-alicloud-cpi-release/blob/47a510af1b73dea69987425c03033d1394b41d1a/src/bosh-alicloud-cpi/action/create_vm.go#L356

and the bosh agent starts up and tries to bootstrap the config and shortly after, during the attach disk step, we stop the vm. This can corrupt bosh agent config file(s), for example, we observed empty /var/vcap/bosh/agent_state.json - which causes bosh agent to enter infinite restart loop:

2021-07-19_14:05:10.39974 [main] 2021/07/19 14:05:10 ERROR - App setup Loading state: Unmarshalling bootstrap state: unexpected end of JSON input
2021-07-19_14:05:10.39974 [main] 2021/07/19 14:05:10 ERROR - Agent exited with error: Loading state: Unmarshalling bootstrap state: unexpected end of JSON input
2021-07-19_14:05:11.40837 [main] 2021/07/19 14:05:11 DEBUG - Starting agent
2021-07-19_14:05:11.40839 [File System] 2021/07/19 14:05:11 DEBUG - Reading file /var/vcap/bosh/agent.json
2021-07-19_14:05:11.40841 [File System] 2021/07/19 14:05:11 DEBUG - Read content
2021-07-19_14:05:11.40841 ********************
2021-07-19_14:05:11.40841 {
2021-07-19_14:05:11.40841   "Platform": {
2021-07-19_14:05:11.40842     "Linux": {
2021-07-19_14:05:11.40842       "PartitionerType": "parted",
2021-07-19_14:05:11.40842       "CreatePartitionIfNoEphemeralDisk": true,
2021-07-19_14:05:11.40842       "DevicePathResolutionType": "virtio"
2021-07-19_14:05:11.40843     }
2021-07-19_14:05:11.40843   },
2021-07-19_14:05:11.40843   "Infrastructure": {
2021-07-19_14:05:11.40844     "Settings": {
2021-07-19_14:05:11.40844       "Sources": [
2021-07-19_14:05:11.40844         {
2021-07-19_14:05:11.40844           "Type": "HTTP",
2021-07-19_14:05:11.40845           "URI": "http://100.100.100.200",
2021-07-19_14:05:11.40845           "UserDataPath": "/latest/user-data",
2021-07-19_14:05:11.40845           "InstanceIDPath": "/latest/meta-data/instance-id",
2021-07-19_14:05:11.40846           "SSHKeysPath": "/latest/meta-data/public-keys/0/openssh-key"
2021-07-19_14:05:11.40846         }
2021-07-19_14:05:11.40846       ],
2021-07-19_14:05:11.40846       "UseServerName": false,
2021-07-19_14:05:11.40847       "UseRegistry": true
2021-07-19_14:05:11.40847     }
2021-07-19_14:05:11.40847   }
2021-07-19_14:05:11.40847 }
2021-07-19_14:05:11.40848 
2021-07-19_14:05:11.40848 ********************
2021-07-19_14:05:11.40848 [File System] 2021/07/19 14:05:11 DEBUG - Reading file /var/vcap/bosh/etc/stemcell_version
2021-07-19_14:05:11.40849 [File System] 2021/07/19 14:05:11 DEBUG - Read content
2021-07-19_14:05:11.40849 ********************
2021-07-19_14:05:11.40849 1.99
2021-07-19_14:05:11.40849 ********************
2021-07-19_14:05:11.40850 [File System] 2021/07/19 14:05:11 DEBUG - Reading file /var/vcap/bosh/etc/stemcell_git_sha1
2021-07-19_14:05:11.40851 [File System] 2021/07/19 14:05:11 DEBUG - Read content
2021-07-19_14:05:11.40851 ********************
2021-07-19_14:05:11.40851 45d72204ed8c7dba77353ff8c2eb53a477372607+
2021-07-19_14:05:11.40852 ********************
2021-07-19_14:05:11.40852 [App] 2021/07/19 14:05:11 INFO - Running on stemcell version '1.99' (git: 45d72204ed8c7dba77353ff8c2eb53a477372607+)
2021-07-19_14:05:11.40852 [File System] 2021/07/19 14:05:11 DEBUG - Checking if file exists /var/vcap/bosh/agent_state.json
2021-07-19_14:05:11.40853 [File System] 2021/07/19 14:05:11 DEBUG - Stat '/var/vcap/bosh/agent_state.json'
2021-07-19_14:05:11.40853 [File System] 2021/07/19 14:05:11 DEBUG - Reading file /var/vcap/bosh/agent_state.json
2021-07-19_14:05:11.40853 [File System] 2021/07/19 14:05:11 DEBUG - Read content
2021-07-19_14:05:11.40853 ********************
2021-07-19_14:05:11.40853 
2021-07-19_14:05:11.40854 ********************
2021-07-19_14:05:11.40854 [main] 2021/07/19 14:05:11 ERROR - App setup Loading state: Unmarshalling bootstrap state: unexpected end of JSON input
2021-07-19_14:05:11.40854 [main] 2021/07/19 14:05:11 ERROR - Agent exited with error: Loading state: Unmarshalling bootstrap state: unexpected end of JSON input

We also observed empty ssh host keys, these are generated on first boot: https://github.com/cloudfoundry/bosh-linux-stemcell-builder/blob/7320d9060f70deb154e0a9fba688e5c7210bf8d6/stemcell_builder/stages/base_ubuntu_firstboot/assets/root/firstboot.sh#L5

Screenshot 2021-07-20 at 09 24 13

When this happens you are not able to ssh into the vm.

yatzek commented 3 years ago

This issue is intermittent, roughly about 50% reproducibility. You should be able to reproduce it: deploy bosh without registry and try deploying something that has a persistent disk and vm is being re-created.

xiaozhu36 commented 3 years ago

Hi @yatzek This issue has been fixed by the release v40.0.0 and please have a check.