Closed simps23 closed 4 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/168999860
The labels on this github issue will be updated when the story is started.
Hi @simps23,
Thanks for filing this issue.
What was the result of Remove old CA and redeploy the cluster (bosh deploy ...)
and the opsfile used? Was the NEWCA the only CA?
Could you please supply the bosh task 29 --debug
? This may contain secrets, so feel free to slack DM with that log.
So far it doesn't seem that the CA change and the vsphere CPI error are related. The CA change for deployed software should not effect the CPI's ability to locate and use the CDROM. Usually when this error occurs, the CDROM has been disconnected.
Is the CDROM detached/disconnected in the IaaS? One thing to try may be to bosh recreate
one of the master instances that is failing, and see if the problem persists. If the CDROM was accidentally disconnected, then you may have trouble recreating as well. If so --fix
is useful, but will not run drain
on the jobs.
Please let me know what you find.
Rebecca, BOSH
Rebecca,
The 'Remove old CA' step is not executed as I stopped on failure of completion of the 2nd/previous step. The K8s nodes and BOSH instances are unreachable/stopped. Are you asking because you think that if I just continue and remove the old CA that things 'might fix themselves'? I can try that, but I believe I executed this in the past because I didn't realize step2 was failing, and the cluster was in a bad state with stopped/unreachable nodes.
After a clean/good installation, the CDROM is disconnected when I look at the vSphere IaaS. I was not sure if this was normal or expected.
Below I am providing logs created during my initial cluster installation. Note that the 'CDROM' errors are ALSO SEEN at this time and that the CDROM is disconnected at this time. I HAVE NOT ATTEMPTED CERTIFICATE ROTATION YET.
Note that the server running BOSH CLI is ~6 hours behind the Master Node. (10:00:00 BOSH CLI Server time = 16:00:00 Master Node Time)
BOSH CLI Server Logs during initial cluster installation (TO NOTE TIMESTAMP OF MASTER NODE CREATION)
[13290 05536 :058392: ftc 10/08/2019 10:08:44 -0600 INFO ] STDOUT - Task 9 | 16:08:42 | Compiled Release has been created: kubo/0.32.0 (00:00:00)
[13290 05536 :058393: ftc 10/08/2019 10:08:45 -0600 INFO ] STDOUT - Task 10 | 16:08:44 | Preparing deployment: Preparing deployment (00:00:01)
[13290 05536 :058394: ftc 10/08/2019 10:08:47 -0600 INFO ] STDOUT - Task 10 | 16:08:45 | Preparing deployment: Rendering templates (00:00:01)
[13290 05536 :058395: ftc 10/08/2019 10:08:47 -0600 INFO ] STDOUT - Task 10 | 16:08:46 | Preparing package compilation: Finding packages to compile (00:00:01)
[13290 05536 :058396: ftc 10/08/2019 10:11:19 -0600 INFO ] STDOUT - Task 10 | 16:08:47 | Compiling packages: keepalived/85fbf94be333316a41ae19959e6807f4dcc60c94 (00:02:31)
[13290 05536 :058397: ftc 10/08/2019 10:12:17 -0600 INFO ] STDOUT - Task 10 | 16:11:18 | Compiling packages: csp-conf/38a9cc92de4fcdb3f12fc9630095a0c6a7b99dcd (00:00:29)
[13290 05536 :058398: ftc 10/08/2019 10:12:17 -0600 INFO ] STDOUT - Task 10 | 16:12:17 | Creating missing vms: master/de8ded8f-526a-4f15-b7c2-687770f2f606 (0)
[13290 05536 :058399: ftc 10/08/2019 10:12:17 -0600 INFO ] STDOUT - Task 10 | 16:12:17 | Creating missing vms: worker/50a1d1c4-613f-4ff3-90fa-bd85319dcf78 (1)
[13290 05536 :058400: ftc 10/08/2019 10:14:16 -0600 INFO ] STDOUT - Task 10 | 16:12:17 | Creating missing vms: worker/3fcfa3e2-59c1-4c9e-8ca1-9b041bb2a3c5 (0) (00:01:58)
[13290 05536 :058401: ftc 10/08/2019 10:14:18 -0600 INFO ] STDOUT - Task 10 | 16:14:16 | Creating missing vms: master/de8ded8f-526a-4f15-b7c2-687770f2f606 (0) (00:01:59)
[13290 05536 :058402: ftc 10/08/2019 10:14:18 -0600 INFO ] STDOUT - Task 10 | 16:14:17 | Creating missing vms: worker/50a1d1c4-613f-4ff3-90fa-bd85319dcf78 (1) (00:02:00)
[13290 05536 :058403: ftc 10/08/2019 10:18:11 -0600 INFO ] STDOUT - Task 10 | 16:14:18 | Updating instance master: master/de8ded8f-526a-4f15-b7c2-687770f2f606 (0) (canary) (00:03:53)
[13290 05536 :058404: ftc 10/08/2019 10:19:18 -0600 INFO ] STDOUT - Task 10 | 16:18:11 | Updating instance worker: worker/3fcfa3e2-59c1-4c9e-8ca1-9b041bb2a3c5 (0) (canary) (00:01:07)
[13290 05536 :058405: ftc 10/08/2019 10:21:14 -0600 INFO ] STDOUT - Task 10 | 16:19:18 | Updating instance worker: worker/50a1d1c4-613f-4ff3-90fa-bd85319dcf78 (1) (00:01:55)
Master Node BOSH Logs (NOTE ERRORS ARE DURING OR CLOSE TO INITIAL INSTALLATION)
master/de8ded8f-526a-4f15-b7c2-687770f2f606:/var/vcap/bosh/log# grep CDROM *
@400000005d9cb65e1b33a174.s:2019-10-08_16:13:36.02046 "Type": "CDROM",
@400000005d9cb65e1b33a174.s:2019-10-08_16:13:39.48018 [cdUtil] 2019/10/08 16:13:39 DEBUG - Umounting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:13:39.49600 [cdUtil] 2019/10/08 16:13:39 DEBUG - Ejecting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:14:12.94900 [settingsService] 2019/10/08 16:14:12 ERROR - Failed loading settings via fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found
@400000005d9cb65e1b33a174.s:2019-10-08_16:16:09.32348 [cdUtil] 2019/10/08 16:16:09 DEBUG - Umounting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:16:09.33188 [cdUtil] 2019/10/08 16:16:09 DEBUG - Ejecting CDROM
@400000005d9cb65e1b33a174.s:2019-10-08_16:16:16.20379 [settingsService] 2019/10/08 16:16:16 ERROR - Failed loading settings via fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found
This is a development system. Let me know if/how I should attach or provide log @400000005d9cb65e1b33a174.s for you to analyze.
I have tried to recreate, manually attach the CDROM via vCenter, restart nodes, etc, and sometimes can get a cluster back into a usable state... but it is VERY 'hit or miss'. I am also unsure if they are actually using the new certificates I provide (may be the nodes were brought back using the 'last successful state')... and I cannot have a solution that does not run 'drain' on the jobs.
When I wrote this issue I didn't notice the CDROM errors were also during initial deployment, because I was focused on the CDROM errors occurring when I tried to rotate the cluster certificates... and I wasn't keeping track of timestamps when I had to redeploy the cluster. Are you saying that it is NOT NORMAL for the CDROM to be disconnected during initial installation? What kind of bad things could happen (other than what I'm experiencing during certificate rotation)? Are there other use cases where the CDROM should have stayed connected?
FYI, I am being told by others on my team that it 'should be expected' that the CDROM is ejected and unmounted after initial installation. vSphere will have performance issues if CDROM's stay connected after cluster deployment is completed.
Hi @simps23,
Thanks for the extra information.
From this so far, it seems there may be an issue with how the vsphere CPI and the centos OS are interacting. The support for centos on vsphere is very limited. The bosh team publishes centos but we aren’t hugely familiar with it. Maybe someone in the community has more experience and can help more.
Would be able to try using a different OS? Does ubuntu-xenial have the same problem for you? Have you confirmed that no manual actions were taken in the vsphere console that could be interfering with your deployment? This is a common problem with using bosh.
When I wrote this issue I didn't notice the CDROM errors were also during initial deployment
This tells me that it's a consistent issue, and the cert rotation is irrelevant. Can you deploy anything successfully with your combo of IaaS x centos?
Are you saying that it is NOT NORMAL for the CDROM to be disconnected during initial installation
The CDROM is used to bootstrap instances. It's critical to be attached during installation. It's up to the CPI if it needs to be detached or not. In theory it can be disconnected, but if the VM is rebooted there could be issues. The error message implies to me that the CDROM needed to be attached when it may not have been, or that the CPI has trouble finding it. On a vm with a CDROM attached, what does a tool like lsblk
say? Is the CDROM at the location the error suggests?
Thanks, Rebecca
Using a different OS is not an option at this point, even for testing as we do not have time or resources for the testing or a change even if it worked.
No manual actions were taken on vSphere. This happens during every deployment that I have looked at since realizing these logs happen during initial installation.
When you ask 'Can you deploy anything successfully with your combo?', yes, we can deploy services into the cluster.
lsblk results after successful deployment (and no other actions)
master/afcb8ab7-735d-4378-9d8f-b48b72a9c5ad:/var/vcap/bosh_ssh/bosh_a16f6aacc2ad4c3# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 3G 0 disk
└─sda1 8:1 0 3G 0 part /
sdb 8:16 0 100G 0 disk
├─sdb1 8:17 0 15.5G 0 part
└─sdb2 8:18 0 84.5G 0 part /var/vcap/data
sdc 8:32 0 5G 0 disk
└─sdc1 8:33 0 5G 0 part /var/vcap/store
sdd 8:48 0 2G 0 disk /var/vcap/data/kubelet/pods/396e7e8e-e9eb-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sde 8:64 0 1G 0 disk /var/vcap/data/kubelet/pods/3974092d-e9eb-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdf 8:80 0 1G 0 disk /var/vcap/data/kubelet/pods/7a856f17-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdg 8:96 0 35G 0 disk /var/vcap/data/kubelet/pods/7ada894e-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdh 8:112 0 25G 0 disk /var/vcap/data/kubelet/pods/9194f4ef-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdi 8:128 0 5G 0 disk /var/vcap/data/kubelet/pods/91c4a2fc-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sdj 8:144 0 5G 0 disk /var/vcap/data/kubelet/pods/91c4a2fc-e9ec-11e9-a8b1-005056be7d56/volumes/kubernetes.io~vspher
sr0 11:0 1 52K 0 rom
What are the next actions you suggest I, or someone on your end, take?
I typed a lot of this out in a private message between us, so pasting/editing here for context/historical:
Looking at the logs and the previous logs provided [in DM], it really looks like to me that:
Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found>
https://github.com/cloudfoundry/bosh-vsphere-cpi-release/blob/master/src/vsphere_cpi/lib/cloud/vsphere/agent_env.rb#L35 shows that the vsphere cpi disconnects the cdrom before updating it. Perhaps this failed at some point and the drive wasn't reconnected. The CPI logs on the director VM will be helpful for that.
If you have access to the audit logs in the IaaS console for the problematic vms, it may show 'who' is disconnecting the disk, if that is happening.
The fact this VM is deployed already, and you're just updating it, implies that it worked fine at one point.
The agent ejects the CDROM after settings are fetched: https://github.com/cloudfoundry/bosh-agent/blob/bdfafc66300b442adecd691464656b8a5b7951e4/platform/cdrom/cdutil.go#L37
So far, then:
It would seem that in 3 there could have potentially been an issue with the cd rom being connected again. The vsphere cpi logs are still a great place to look, in addition to the vsphere audit logs to determine any other CDROM disconnects.
Closing as this issue is known and has a workaround.
Describe the bug Executing the 'bosh deploy ...' command while attempting to rotate and utilize new CFCR certificates using vSphere without CredHub results in the following error:
L Error: Action Failed get_task: Task a4c9b887-6e49-44b1-527a-b697e773ad60 result: Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found
To Reproduce
Background: To rotate CFCR certificates (kubo_ca, etcd.ca, kubernetes-dashboard-ca) requires a 3 step process.
step1-add-new-CA-ops.yml
bosh deploy -e my-env -d my-dep /tmp/manifest.yml -o step2-utilize-new-ID-certificate-from-new-CA-ops.yml --vars-store creds.yml
ERROR RECEIVED
L Error: Action Failed get_task: Task a4c9b887-6e49-44b1-527a-b697e773ad60 result: Refreshing the settings: Invoking settings fetcher: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found
step2-utilize-new-ID-certificate-from-new-CA-ops.yml
Expected behavior I expect the 3 step process to rotate CFCR certificates to complete without error.
Logs
Step 1 Logs
Step 2 Logs
Versions (please complete the following information):
Deployment info: If possible, share your (redacted) manifest and any ops files used to deploy BOSH or any other releases on top of BOSH.
If you used any deployment strategy it'd be helpful to point it out and share as much about it as possible (e.g. bosh-deployment, PCF, genesis, spiff, etc)
Additional context To state again, my goal is to rotate ALL the CFCR certificates
This issue and its details only describe kubernetes-dashboard-ca because it has only one CA and one leaf ID certificate, for simplicity.