cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.03k stars 658 forks source link

Rotating CFCR certificates using vSphere without CredHub results in stopped instance due to failure to find CDROM #2213

Closed simps23 closed 4 years ago

simps23 commented 4 years ago

Describe the bug Executing the 'bosh deploy ...' command while attempting to rotate and utilize new CFCR certificates using vSphere without CredHub results in the following error:

L Error: Action Failed get_task: Task a4c9b887-6e49-44b1-527a-b697e773ad60 result: Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found

To Reproduce

Background: To rotate CFCR certificates (kubo_ca, etcd.ca, kubernetes-dashboard-ca) requires a 3 step process.

  1. Add new CA to deployment and redeploy the cluster (bosh deploy ...)
    bosh -e my-env -d my-dep manifest > manifest.yml
    bosh -e my-env -d my-dep deploy manifest.yml -o step1-add-new-CA-ops.yml --vars-store creds.yml

step1-add-new-CA-ops.yml

######################
# Variables
######################

#################################
### kubernetes-dashboard-ca
#################################

- type: replace
  path: /variables/-
  value:
    name: kubernetes-dashboard-ca_NEWCA
    options:
      common_name: ca
      is_ca: true
    type: certificate

- type: replace
  path: /variables/-
  value:
    name: tls-kubernetes-dashboard_NEWCA
    type: certificate
    options:
      alternative_names: []
      ca: kubernetes-dashboard-ca_NEWCA
      common_name: kubernetesdashboard.cfcr.internal

######################
# Master
######################

#################################
### apply-specs
#################################

- type: replace
  path: /instance_groups/name=master/jobs/name=apply-specs/properties/tls/kubernetes-dashboard
  value:
    ca: ((tls-kubernetes-dashboard.ca))
    certificate: ((tls-kubernetes-dashboard.certificate))
    private_key: ((tls-kubernetes-dashboard.private_key))

- type: replace
  path: /instance_groups/name=master/jobs/name=apply-specs/properties/tls/kubernetes-dashboard/ca
  value: ((tls-kubernetes-dashboard_NEWCA.ca))((tls-kubernetes-dashboard.ca))
  1. Utilize new ID certificates generated by the new CA and redeploy the cluster (bosh deploy ...) bosh deploy -e my-env -d my-dep /tmp/manifest.yml -o step2-utilize-new-ID-certificate-from-new-CA-ops.yml --vars-store creds.yml

ERROR RECEIVED L Error: Action Failed get_task: Task a4c9b887-6e49-44b1-527a-b697e773ad60 result: Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found

step2-utilize-new-ID-certificate-from-new-CA-ops.yml

######################
# Master
######################

#################################
### apply-specs
#################################

- type: replace
  path: /instance_groups/name=master/jobs/name=apply-specs/properties/tls/kubernetes-dashboard/certificate
  value: ((tls-kubernetes-dashboard_NEWCA.certificate))

- type: replace
  path: /instance_groups/name=master/jobs/name=apply-specs/properties/tls/kubernetes-dashboard/private_key
  value: ((tls-kubernetes-dashboard_NEWCA.private_key))
  1. Remove old CA and redeploy the cluster (bosh deploy ...)
    • N/A

Expected behavior I expect the 3 step process to rotate CFCR certificates to complete without error.

Logs

Step 1 Logs

using environment 'my-env' as client 'admin'                                                                                          

Using deployment 'my-dep'

######################################################## 100.00% 486.33 KiB/s 0s
Task 13                                                                         

Task 13 | 20:39:09 | Extracting release: Extracting release (00:00:00)
Task 13 | 20:39:09 | Verifying manifest: Verifying manifest (00:00:00)
Task 13 | 20:39:09 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 13 | 20:39:09 | Processing 1 existing package: Processing 1 existing package (00:00:00)  
Task 13 | 20:39:09 | Processing 1 existing job: Processing 1 existing job (00:00:00)          
Task 13 | 20:39:09 | Release has been created: csp-conf/1.0.75292 (00:00:00)                  

Task 13 Started  Mon Oct  7 20:39:09 UTC 2019
Task 13 Finished Mon Oct  7 20:39:09 UTC 2019
Task 13 Duration 00:00:00                    
Task 13 done                                 
########################################################## 100.00% 2.08 MiB/s 0s

Task 14

Task 14 | 20:39:30 | Extracting release: Extracting release (00:00:00)
Task 14 | 20:39:30 | Verifying manifest: Verifying manifest (00:00:00)
Task 14 | 20:39:30 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 14 | 20:39:30 | Processing 13 existing jobs: Processing 13 existing jobs (00:00:00)      
Task 14 | 20:39:30 | Release has been created: os-conf/20.0.0 (00:00:00)                      

Task 14 Started  Mon Oct  7 20:39:30 UTC 2019
Task 14 Finished Mon Oct  7 20:39:30 UTC 2019
Task 14 Duration 00:00:00                    
Task 14 done                                 
######################################################## 100.00% 609.21 KiB/s 0s

Task 15

Task 15 | 20:39:55 | Extracting release: Extracting release (00:00:00)
Task 15 | 20:39:56 | Verifying manifest: Verifying manifest (00:00:00)
Task 15 | 20:39:56 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 15 | 20:39:56 | Processing 1 existing package: Processing 1 existing package (00:00:00)  
Task 15 | 20:39:56 | Processing 1 existing job: Processing 1 existing job (00:00:00)          
Task 15 | 20:39:56 | Release has been created: keepalived/1.0.75292 (00:00:00)                

Task 15 Started  Mon Oct  7 20:39:55 UTC 2019
Task 15 Finished Mon Oct  7 20:39:56 UTC 2019
Task 15 Duration 00:00:01                    
Task 15 done                                 
######################################################## 100.00% 715.36 KiB/s 0s

######################################################## 100.00% 481.85 KiB/s 0s

Task 17
Task 16 | 20:40:04 | Extracting release: Extracting release (00:00:00)
Task 16 | 20:40:04 | Verifying manifest: Verifying manifest (00:00:00)
Task 17 | 20:40:04 | Extracting release: Extracting release (00:00:00)
Task 17 | 20:40:04 | Verifying manifest: Verifying manifest (00:00:00)
Task 16 | 20:40:04 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 16 | 20:40:04 | Processing 4 existing jobs: Processing 4 existing jobs (00:00:00)
Task 16 | 20:40:04 | Compiled Release has been created: cfcr-etcd/1.11.1 (00:00:00)

Task 16 Started  Mon Oct  7 20:40:04 UTC 2019
Task 16 Finished Mon Oct  7 20:40:04 UTC 2019
Task 16 Duration 00:00:00
Task 16 done

Task 17 | 20:40:04 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 17 | 20:40:04 | Processing 3 existing jobs: Processing 3 existing jobs (00:00:00)
Task 17 | 20:40:04 | Compiled Release has been created: bpm/1.0.4 (00:00:00)

Task 17 Started  Mon Oct  7 20:40:04 UTC 2019
Task 17 Finished Mon Oct  7 20:40:04 UTC 2019
Task 17 Duration 00:00:00
Task 17 done
########################################################## 100.00% 3.19 MiB/s 0s

Task 18

Task 18 | 20:40:34 | Extracting release: Extracting release (00:00:00)
Task 18 | 20:40:34 | Verifying manifest: Verifying manifest (00:00:00)
Task 18 | 20:40:34 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 18 | 20:40:34 | Processing 6 existing jobs: Processing 6 existing jobs (00:00:00)
Task 18 | 20:40:34 | Compiled Release has been created: docker/35.2.1 (00:00:00)

Task 18 Started  Mon Oct  7 20:40:34 UTC 2019
Task 18 Finished Mon Oct  7 20:40:34 UTC 2019
Task 18 Duration 00:00:00
Task 18 done
########################################################## 100.00% 5.04 MiB/s 0s

Task 19

Task 19 | 20:40:37 | Extracting release: Extracting release (00:00:00)
Task 19 | 20:40:37 | Verifying manifest: Verifying manifest (00:00:00)
Task 19 | 20:40:37 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 19 | 20:40:37 | Processing 14 existing jobs: Processing 14 existing jobs (00:00:00)
Task 19 | 20:40:37 | Compiled Release has been created: kubo/0.32.0 (00:00:00)

Task 19 Started  Mon Oct  7 20:40:37 UTC 2019
Task 19 Finished Mon Oct  7 20:40:37 UTC 2019
Task 19 Duration 00:00:00
Task 19 done
  instance_groups:
  - name: master
    jobs:
    - name: apply-specs
      properties:
        tls:
          kubernetes-dashboard:
-           ca: "<redacted>"
+           ca: "<redacted>"

Continue? [yN]: y

Task 20

Task 20 | 20:41:36 | Preparing deployment: Preparing deployment (00:00:01)
Task 20 | 20:41:37 | Preparing deployment: Rendering templates (00:00:01)
Task 20 | 20:41:39 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 20 | 20:41:39 | Updating instance master: master/bfdbe0c8-19ee-4077-a5a5-dde20b98a067 (0) (canary) (00:02:19)

Task 20 Started  Mon Oct  7 20:41:36 UTC 2019
Task 20 Finished Mon Oct  7 20:43:58 UTC 2019
Task 20 Duration 00:02:22
Task 20 done

Succeeded

Step 2 Logs

Using environment 'my-env' as client 'admin'                                                                                              

Using deployment 'my-dep'

############################################################# 100.00% 618 B/s 3s
Task 22                                                                         

Task 22 | 20:49:08 | Extracting release: Extracting release (00:00:00)
Task 22 | 20:49:08 | Verifying manifest: Verifying manifest (00:00:00)
Task 22 | 20:49:08 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 22 | 20:49:08 | Processing 1 existing package: Processing 1 existing package (00:00:00)  
Task 22 | 20:49:08 | Processing 1 existing job: Processing 1 existing job (00:00:00)          
Task 22 | 20:49:08 | Release has been created: csp-conf/1.0.75292 (00:00:00)                  

Task 22 Started  Mon Oct  7 20:49:08 UTC 2019
Task 22 Finished Mon Oct  7 20:49:08 UTC 2019
Task 22 Duration 00:00:00                    
Task 22 done                                 
########################################################## 100.00% 1.92 MiB/s 0s

Task 23

Task 23 | 20:49:36 | Extracting release: Extracting release (00:00:00)
Task 23 | 20:49:36 | Verifying manifest: Verifying manifest (00:00:00)
Task 23 | 20:49:37 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 23 | 20:49:37 | Processing 13 existing jobs: Processing 13 existing jobs (00:00:00)      
Task 23 | 20:49:37 | Release has been created: os-conf/20.0.0 (00:00:00)                      

Task 23 Started  Mon Oct  7 20:49:36 UTC 2019
Task 23 Finished Mon Oct  7 20:49:37 UTC 2019
Task 23 Duration 00:00:01                    
Task 23 done                                 
######################################################## 100.00% 613.56 KiB/s 0s

Task 24

Task 24 | 20:49:42 | Extracting release: Extracting release (00:00:00)
Task 24 | 20:49:42 | Verifying manifest: Verifying manifest (00:00:00)
Task 24 | 20:49:42 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 24 | 20:49:42 | Processing 1 existing package: Processing 1 existing package (00:00:00)  
Task 24 | 20:49:42 | Processing 1 existing job: Processing 1 existing job (00:00:00)          
Task 24 | 20:49:42 | Release has been created: keepalived/1.0.75292 (00:00:00)                

Task 24 Started  Mon Oct  7 20:49:42 UTC 2019
Task 24 Finished Mon Oct  7 20:49:42 UTC 2019
Task 24 Duration 00:00:00                    
Task 24 done                                 
######################################################## 100.00% 484.04 KiB/s 0s
########################################################## 100.00% 1.04 MiB/s 0s

Task 25
Task 26
Task 25 | 20:49:43 | Extracting release: Extracting release (00:00:00)
Task 25 | 20:49:43 | Verifying manifest: Verifying manifest (00:00:00)
Task 25 | 20:49:43 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 25 | 20:49:43 | Processing 3 existing jobs: Processing 3 existing jobs (00:00:00)        
Task 25 | 20:49:43 | Compiled Release has been created: bpm/1.0.4 (00:00:00)                  

Task 25 Started  Mon Oct  7 20:49:43 UTC 2019
Task 25 Finished Mon Oct  7 20:49:43 UTC 2019
Task 25 Duration 00:00:00
Task 25 done

Task 26 | 20:49:44 | Extracting release: Extracting release (00:00:00)
Task 26 | 20:49:44 | Verifying manifest: Verifying manifest (00:00:00)
Task 26 | 20:49:44 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 26 | 20:49:44 | Processing 4 existing jobs: Processing 4 existing jobs (00:00:00)
Task 26 | 20:49:44 | Compiled Release has been created: cfcr-etcd/1.11.1 (00:00:00)

Task 26 Started  Mon Oct  7 20:49:44 UTC 2019
Task 26 Finished Mon Oct  7 20:49:44 UTC 2019
Task 26 Duration 00:00:00
Task 26 done
########################################################## 100.00% 2.48 MiB/s 0s

Task 27

Task 27 | 20:49:47 | Extracting release: Extracting release (00:00:00)
Task 27 | 20:49:47 | Verifying manifest: Verifying manifest (00:00:00)
Task 27 | 20:49:47 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 27 | 20:49:48 | Processing 6 existing jobs: Processing 6 existing jobs (00:00:00)
Task 27 | 20:49:48 | Compiled Release has been created: docker/35.2.1 (00:00:00)

Task 27 Started  Mon Oct  7 20:49:47 UTC 2019
Task 27 Finished Mon Oct  7 20:49:48 UTC 2019
Task 27 Duration 00:00:01
Task 27 done
########################################################## 100.00% 5.63 MiB/s 0s

Task 28

Task 28 | 20:49:52 | Extracting release: Extracting release (00:00:00)
Task 28 | 20:49:52 | Verifying manifest: Verifying manifest (00:00:00)
Task 28 | 20:49:52 | Resolving package dependencies: Resolving package dependencies (00:00:00)
Task 28 | 20:49:52 | Processing 14 existing jobs: Processing 14 existing jobs (00:00:00)
Task 28 | 20:49:52 | Compiled Release has been created: kubo/0.32.0 (00:00:00)

Task 28 Started  Mon Oct  7 20:49:52 UTC 2019
Task 28 Finished Mon Oct  7 20:49:52 UTC 2019
Task 28 Duration 00:00:00
Task 28 done
  instance_groups:
  - name: master
    jobs:
    - name: apply-specs
      properties:
        tls:
          kubernetes-dashboard:
-           certificate: "<redacted>"
+           certificate: "<redacted>"
-           private_key: "<redacted>"
+           private_key: "<redacted>"

Continue? [yN]: y

Task 29

Task 29 | 20:50:29 | Preparing deployment: Preparing deployment (00:00:01)
Task 29 | 20:50:30 | Preparing deployment: Rendering templates (00:00:02)
Task 29 | 20:50:32 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 29 | 20:50:32 | Updating instance master: master/bfdbe0c8-19ee-4077-a5a5-dde20b98a067 (0) (canary) (00:02:15)
                   L Error: Action Failed get_task: Task 871354b6-0c16-4b3d-4ec3-ec5ee3f90e5e result: Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found
Task 29 | 20:52:47 | Error: Action Failed get_task: Task 871354b6-0c16-4b3d-4ec3-ec5ee3f90e5e result: Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found

Task 29 Started  Mon Oct  7 20:50:29 UTC 2019
Task 29 Finished Mon Oct  7 20:52:47 UTC 2019
Task 29 Duration 00:02:18
Task 29 error

Updating deployment:
  Expected task '29' to succeed but state is 'error'

Exit code 1

Versions (please complete the following information):

Deployment info: If possible, share your (redacted) manifest and any ops files used to deploy BOSH or any other releases on top of BOSH.

If you used any deployment strategy it'd be helpful to point it out and share as much about it as possible (e.g. bosh-deployment, PCF, genesis, spiff, etc)

Additional context To state again, my goal is to rotate ALL the CFCR certificates

This issue and its details only describe kubernetes-dashboard-ca because it has only one CA and one leaf ID certificate, for simplicity.

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/168999860

The labels on this github issue will be updated when the story is started.

h4xnoodle commented 4 years ago

Hi @simps23,

Thanks for filing this issue.

What was the result of Remove old CA and redeploy the cluster (bosh deploy ...) and the opsfile used? Was the NEWCA the only CA?

Could you please supply the bosh task 29 --debug? This may contain secrets, so feel free to slack DM with that log.

So far it doesn't seem that the CA change and the vsphere CPI error are related. The CA change for deployed software should not effect the CPI's ability to locate and use the CDROM. Usually when this error occurs, the CDROM has been disconnected.

Is the CDROM detached/disconnected in the IaaS? One thing to try may be to bosh recreate one of the master instances that is failing, and see if the problem persists. If the CDROM was accidentally disconnected, then you may have trouble recreating as well. If so --fix is useful, but will not run drain on the jobs.

Please let me know what you find.

Rebecca, BOSH

simps23 commented 4 years ago

Rebecca,

The 'Remove old CA' step is not executed as I stopped on failure of completion of the 2nd/previous step. The K8s nodes and BOSH instances are unreachable/stopped. Are you asking because you think that if I just continue and remove the old CA that things 'might fix themselves'? I can try that, but I believe I executed this in the past because I didn't realize step2 was failing, and the cluster was in a bad state with stopped/unreachable nodes.

After a clean/good installation, the CDROM is disconnected when I look at the vSphere IaaS. I was not sure if this was normal or expected.

Below I am providing logs created during my initial cluster installation. Note that the 'CDROM' errors are ALSO SEEN at this time and that the CDROM is disconnected at this time. I HAVE NOT ATTEMPTED CERTIFICATE ROTATION YET.

Note that the server running BOSH CLI is ~6 hours behind the Master Node. (10:00:00 BOSH CLI Server time = 16:00:00 Master Node Time)

BOSH CLI Server Logs during initial cluster installation (TO NOTE TIMESTAMP OF MASTER NODE CREATION)

[13290 05536 :058392: ftc   10/08/2019 10:08:44 -0600 INFO    ] STDOUT - Task 9 | 16:08:42 | Compiled Release has been created: kubo/0.32.0 (00:00:00)
[13290 05536 :058393: ftc   10/08/2019 10:08:45 -0600 INFO    ] STDOUT - Task 10 | 16:08:44 | Preparing deployment: Preparing deployment (00:00:01)
[13290 05536 :058394: ftc   10/08/2019 10:08:47 -0600 INFO    ] STDOUT - Task 10 | 16:08:45 | Preparing deployment: Rendering templates (00:00:01)
[13290 05536 :058395: ftc   10/08/2019 10:08:47 -0600 INFO    ] STDOUT - Task 10 | 16:08:46 | Preparing package compilation: Finding packages to compile (00:00:01)
[13290 05536 :058396: ftc   10/08/2019 10:11:19 -0600 INFO    ] STDOUT - Task 10 | 16:08:47 | Compiling packages: keepalived/85fbf94be333316a41ae19959e6807f4dcc60c94 (00:02:31)
[13290 05536 :058397: ftc   10/08/2019 10:12:17 -0600 INFO    ] STDOUT - Task 10 | 16:11:18 | Compiling packages: csp-conf/38a9cc92de4fcdb3f12fc9630095a0c6a7b99dcd (00:00:29)
[13290 05536 :058398: ftc   10/08/2019 10:12:17 -0600 INFO    ] STDOUT - Task 10 | 16:12:17 | Creating missing vms: master/de8ded8f-526a-4f15-b7c2-687770f2f606 (0)
[13290 05536 :058399: ftc   10/08/2019 10:12:17 -0600 INFO    ] STDOUT - Task 10 | 16:12:17 | Creating missing vms: worker/50a1d1c4-613f-4ff3-90fa-bd85319dcf78 (1)
[13290 05536 :058400: ftc   10/08/2019 10:14:16 -0600 INFO    ] STDOUT - Task 10 | 16:12:17 | Creating missing vms: worker/3fcfa3e2-59c1-4c9e-8ca1-9b041bb2a3c5 (0) (00:01:58)
[13290 05536 :058401: ftc   10/08/2019 10:14:18 -0600 INFO    ] STDOUT - Task 10 | 16:14:16 | Creating missing vms: master/de8ded8f-526a-4f15-b7c2-687770f2f606 (0) (00:01:59)
[13290 05536 :058402: ftc   10/08/2019 10:14:18 -0600 INFO    ] STDOUT - Task 10 | 16:14:17 | Creating missing vms: worker/50a1d1c4-613f-4ff3-90fa-bd85319dcf78 (1) (00:02:00)
[13290 05536 :058403: ftc   10/08/2019 10:18:11 -0600 INFO    ] STDOUT - Task 10 | 16:14:18 | Updating instance master: master/de8ded8f-526a-4f15-b7c2-687770f2f606 (0) (canary) (00:03:53)
[13290 05536 :058404: ftc   10/08/2019 10:19:18 -0600 INFO    ] STDOUT - Task 10 | 16:18:11 | Updating instance worker: worker/3fcfa3e2-59c1-4c9e-8ca1-9b041bb2a3c5 (0) (canary) (00:01:07)
[13290 05536 :058405: ftc   10/08/2019 10:21:14 -0600 INFO    ] STDOUT - Task 10 | 16:19:18 | Updating instance worker: worker/50a1d1c4-613f-4ff3-90fa-bd85319dcf78 (1) (00:01:55)

Master Node BOSH Logs (NOTE ERRORS ARE DURING OR CLOSE TO INITIAL INSTALLATION)

master/de8ded8f-526a-4f15-b7c2-687770f2f606:/var/vcap/bosh/log# grep CDROM *
@400000005d9cb65e1b33a174.s:2019-10-08_16:13:36.02046           "Type": "CDROM",
@400000005d9cb65e1b33a174.s:2019-10-08_16:13:39.48018 [cdUtil] 2019/10/08 16:13:39 DEBUG - Umounting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:13:39.49600 [cdUtil] 2019/10/08 16:13:39 DEBUG - Ejecting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:14:12.94900 [settingsService] 2019/10/08 16:14:12 ERROR - Failed loading settings via fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found
@400000005d9cb65e1b33a174.s:2019-10-08_16:16:09.32348 [cdUtil] 2019/10/08 16:16:09 DEBUG - Umounting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:16:09.33188 [cdUtil] 2019/10/08 16:16:09 DEBUG - Ejecting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:16:16.20379 [settingsService] 2019/10/08 16:16:16 ERROR - Failed loading settings via fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found

This is a development system. Let me know if/how I should attach or provide log @400000005d9cb65e1b33a174.s for you to analyze.

I have tried to recreate, manually attach the CDROM via vCenter, restart nodes, etc, and sometimes can get a cluster back into a usable state... but it is VERY 'hit or miss'. I am also unsure if they are actually using the new certificates I provide (may be the nodes were brought back using the 'last successful state')... and I cannot have a solution that does not run 'drain' on the jobs.

When I wrote this issue I didn't notice the CDROM errors were also during initial deployment, because I was focused on the CDROM errors occurring when I tried to rotate the cluster certificates... and I wasn't keeping track of timestamps when I had to redeploy the cluster. Are you saying that it is NOT NORMAL for the CDROM to be disconnected during initial installation? What kind of bad things could happen (other than what I'm experiencing during certificate rotation)? Are there other use cases where the CDROM should have stayed connected?

FYI, I am being told by others on my team that it 'should be expected' that the CDROM is ejected and unmounted after initial installation. vSphere will have performance issues if CDROM's stay connected after cluster deployment is completed.

h4xnoodle commented 4 years ago

Hi @simps23,

Thanks for the extra information.

From this so far, it seems there may be an issue with how the vsphere CPI and the centos OS are interacting. The support for centos on vsphere is very limited. The bosh team publishes centos but we aren’t hugely familiar with it. Maybe someone in the community has more experience and can help more.

Would be able to try using a different OS? Does ubuntu-xenial have the same problem for you? Have you confirmed that no manual actions were taken in the vsphere console that could be interfering with your deployment? This is a common problem with using bosh.

When I wrote this issue I didn't notice the CDROM errors were also during initial deployment

This tells me that it's a consistent issue, and the cert rotation is irrelevant. Can you deploy anything successfully with your combo of IaaS x centos?

Are you saying that it is NOT NORMAL for the CDROM to be disconnected during initial installation

The CDROM is used to bootstrap instances. It's critical to be attached during installation. It's up to the CPI if it needs to be detached or not. In theory it can be disconnected, but if the VM is rebooted there could be issues. The error message implies to me that the CDROM needed to be attached when it may not have been, or that the CPI has trouble finding it. On a vm with a CDROM attached, what does a tool like lsblk say? Is the CDROM at the location the error suggests?

Thanks, Rebecca

simps23 commented 4 years ago

Using a different OS is not an option at this point, even for testing as we do not have time or resources for the testing or a change even if it worked.

No manual actions were taken on vSphere. This happens during every deployment that I have looked at since realizing these logs happen during initial installation.

When you ask 'Can you deploy anything successfully with your combo?', yes, we can deploy services into the cluster.

lsblk results after successful deployment (and no other actions)

master/afcb8ab7-735d-4378-9d8f-b48b72a9c5ad:/var/vcap/bosh_ssh/bosh_a16f6aacc2ad4c3# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0    3G  0 disk
└─sda1   8:1    0    3G  0 part /
sdb      8:16   0  100G  0 disk
├─sdb1   8:17   0 15.5G  0 part
└─sdb2   8:18   0 84.5G  0 part /var/vcap/data
sdc      8:32   0    5G  0 disk
└─sdc1   8:33   0    5G  0 part /var/vcap/store
sdd      8:48   0    2G  0 disk /var/vcap/data/kubelet/pods/396e7e8e-e9eb-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sde      8:64   0    1G  0 disk /var/vcap/data/kubelet/pods/3974092d-e9eb-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdf      8:80   0    1G  0 disk /var/vcap/data/kubelet/pods/7a856f17-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdg      8:96   0   35G  0 disk /var/vcap/data/kubelet/pods/7ada894e-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdh      8:112  0   25G  0 disk /var/vcap/data/kubelet/pods/9194f4ef-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdi      8:128  0    5G  0 disk /var/vcap/data/kubelet/pods/91c4a2fc-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdj      8:144  0    5G  0 disk /var/vcap/data/kubelet/pods/91c4a2fc-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sr0     11:0    1   52K  0 rom

What are the next actions you suggest I, or someone on your end, take?

h4xnoodle commented 4 years ago

I typed a lot of this out in a private message between us, so pasting/editing here for context/historical:

Looking at the logs and the previous logs provided [in DM], it really looks like to me that:

  1. CDROM is detached for some reason
  2. Updating an instance causes the bosh-agent to refresh the settings Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found>

https://github.com/cloudfoundry/bosh-vsphere-cpi-release/blob/master/src/vsphere_cpi/lib/cloud/vsphere/agent_env.rb#L35 shows that the vsphere cpi disconnects the cdrom before updating it. Perhaps this failed at some point and the drive wasn't reconnected. The CPI logs on the director VM will be helpful for that.

If you have access to the audit logs in the IaaS console for the problematic vms, it may show 'who' is disconnecting the disk, if that is happening.

The fact this VM is deployed already, and you're just updating it, implies that it worked fine at one point.

The agent ejects the CDROM after settings are fetched: https://github.com/cloudfoundry/bosh-agent/blob/bdfafc66300b442adecd691464656b8a5b7951e4/platform/cdrom/cdutil.go#L37

So far, then:

  1. Deployed a VM successfully [cdrom was attached, read successfully, then ejected]
  2. Attempt to rotate certs for a deployment, updating a vm fails with the above error [cdrom doesn't seem to be attached]
  3. The cpi should be updating the agent settings and then presumably attaching the cdrom again [vsphere CPI]
  4. Agent will then again look for the cd and read settings and then eject.

It would seem that in 3 there could have potentially been an issue with the cd rom being connected again. The vsphere cpi logs are still a great place to look, in addition to the vsphere audit logs to determine any other CDROM disconnects.

h4xnoodle commented 4 years ago

Closing as this issue is known and has a workaround.