cloudfoundry-attic / bosh-init

bosh-init is a tool used to create and update the Director VM
Apache License 2.0
31 stars 33 forks source link

Problem with ssl_tunnel when deploying BOSH to OpenStack using bosh-init #68

Closed yagweb closed 8 years ago

yagweb commented 8 years ago

I have an OpenStack cloud installed on VirtualBox virtual machines. I tried to deploy BOSH to the OpenStack using bosh-init according to the manual. In the resource_pools section of the deployment manifest, I set the instance_type to m1.large, shown as below,

resource_pools:
- name: vms
  network: private
  stemcell:
    url: https://bosh.io/d/stemcells/bosh-openstack-kvm-ubuntu-trusty-go_agent?v=3197
    sha1: 6bbba614066d77b1fc6d758d33c7e3cc5d2b682a
  cloud_properties:
    instance_type: m1.large

Every thing is OK until bosh-init waiting for the agent on VM to be ready, bosh-init failed with the error message shown as below,

Command 'deploy' failed:
Deploying:
    Creating instance 'bosh/0':
      Waiting until instance is ready:
        Starting SSH tunnel:
          Failed to connect to remote server:
            dial tcp 172.16.0.139:22:getsocketopt: connection refused

But after several minutes the VM is ready, I could ssh into the VM.

I thought it may be caused by timeout of the waiting at first, because I tried several times and found when this problem happened the VM was in different initializing stage. According to this assumption, I found this post https://github.com/cloudfoundry/bosh-init/issues/67. Fellow up cppforlife's comments, I added the custom options below to the deployment manifest,

cloud_provider:
  ssh_tunnel:
    connection_refused_timeout: x
    auth_failure_timeout: x

I don't know whether the x is the letter x or it should be replaced by a number, so I tried both the letter x and the number 36000. The problem still came after these modifications. But in these attempts I found the problem show immediately during the VM restarts OpenSSH server, shown as below,

...
Creating SSH2 RSA key; this may take some time ... 
Creating SSH2 DSA key; this may take some time ... 
Creating SSH2 ECDSA key; this may take some time ... 
Creating SSH2 ED25519 key; this may take some time ... 
* Stopping OpenSSH server                                            [ OK ]  
* Starting OpenSSH server                                            [ OK ]
...

That is to say, when the VM Stopping OpenSSH server, the bosh-init return the error message immediately , even the VM Starting the OpenSSH server again. I also tried serveral versions of the ubuntu stemcells, same problem still existed. (Tried CentOS stemcells with another error).

After attempts with different parameter combinations, I found a workaround, changing the instance_type from m1.large to m1.small can make the deploying process continue sometimes. But the deploying time is extremely long with about 11 hours', and some jobs may failed in running as blow,

...
Compiling package 'director/5a8e3bd5b495695932226d0e5da57e7ef854b7de'... Finished (01:11:31)
  Updating instance 'bosh/0'... Finished (00:07:15)
  Waiting for instance 'bosh/0' to be running... Failed (00:19:25)
Failed deploying (11:29:35)

Stopping registry... Finished (00:00:00)
Cleaning up rendered CPI jobs... Finished (00:00:00)

Command 'deploy' failed:
  Deploying:
    Received non-running job state: 'failing'

Set instance_type to m1.medium was not working neither. It seems that when there are limited VCPUs and memory, sometimes the message of stopping OpenSSH server can be missed by bosh-init.

My question is, whether I missed some options of the ssh_tunnel or is it a bug of the sshtunnel in order to use the instance with Flavor be ml.large? Or can bosh-init continue to deploy bosh on a existed VM?

ajay-aggarwal commented 8 years ago

I am sort of running into a similar problem. My bosh-init gets stuck at exactly same point "Waiting for the agent on VM xxx to be ready...". Enabling bosh-init logging I see this in the log

[sshTunnel] 2016/03/03 21:11:12 DEBUG - Dialing remote server at 10.152.5.48:22 [sshTunnel] 2016/03/03 21:11:12 DEBUG - Making attempt #0 [sshTunnel] 2016/03/03 21:26:59 DEBUG - Attempt failed #0: Dialing remote server: ssh: handshake failed: EOF [sshTunnel] 2016/03/03 21:27:00 DEBUG - Making attempt #1 ...

I could verify that SSH port 22 is open and accessible. Not sure whats the reason.

You mentioned you were able to ssh into the VM. Did you use ssh-key or username/password to ssh into the VM. What key or username/password did you use?

yagweb commented 8 years ago

@ajay-aggarwal I used the ssh-key which was created in the "Key Pairs" of the OpenStack project and specified for the instance by the "default_key_name" field and "private_key“ field in the deployment manifest. The ssh-key is same as the key used by bosh-init. In my case, the SSH port 22 of the instance is opened, then closed and opened again during its initializing. It seems like the bosh-init aborted when the SSH port 22 closed. I can ssh into the VM after the port opened again.

cppforlife commented 8 years ago

@yagweb looks like your env running virtualization inside virtualization and that's resulting in a extremely slow execution times. not much we can do about that except suggesting to switch to an env that is faster. i doubt you'll be able to deploy anything to your existing env if deploying bosh itself is that slow.

@ajay-aggarwal i would recommend opening a different issue https://github.com/cloudfoundry-incubator/bosh-openstack-cpi-release. seems to be just misconfiguration in your openstack env.

ganeshkaila commented 7 years ago

@cppforlife I also got the similar situation just like @yagweb.

===== 2016-12-22 13:36:38 UTC Running "bosh-init deploy /var/tempest/workspaces/default/deployments/bosh.yml"
Deployment manifest: '/var/tempest/workspaces/default/deployments/bosh.yml'
Deployment state: '/var/tempest/workspaces/default/deployments/bosh-state.json'

Started validating
  Validating release 'bosh'... Finished (00:01:23)
  Validating release 'bosh-openstack-cpi'... Finished (00:00:04)
  Validating release 'uaa'... Finished (00:00:49)
  Validating cpi release... Finished (00:00:00)
  Validating deployment manifest... Finished (00:00:00)
  Validating stemcell... Finished (00:00:50)
Finished validating (00:03:08)

Started installing CPI
  Compiling package 'ruby_openstack_cpi/9485b5753d4609e92e1491ff991cb28fbde81445'... Finished (01:36:57)
  Compiling package 'bosh_openstack_cpi/dd0bab98dbb820af3ec59b364badfed02ffe3f3b'... Finished (00:00:41)
  Installing packages... Finished (00:00:07)
  Rendering job templates... Finished (00:00:39)
  Installing job 'openstack_cpi'... Finished (00:00:00)
Finished installing CPI (01:38:24)

Starting registry... Finished (00:00:00)
Uploading stemcell 'bosh-openstack-kvm-ubuntu-trusty-go_agent-raw/3312.9'... Finished (00:05:19)

Started deploying
  Creating VM for instance 'bosh/0' from stemcell '7b886ad0-e67c-48e9-8f26-52762210acd9'... Finished (00:01:17)
  Waiting for the agent on VM '3275a1eb-9d32-44e2-936a-fd74652919a4' to be ready... Failed (00:00:00)
Failed deploying (00:01:17)

Stopping registry... Finished (00:00:00)
Cleaning up rendered CPI jobs... Finished (00:00:00)

Command 'deploy' failed:
  Deploying:
    Creating instance 'bosh/0':
      Waiting until instance is ready:
        Starting SSH tunnel:
          Parsing private key file '/tmp/bosh_ec2_private_key.pem':
            asn1: structure error: superfluous leading zeros in length
===== 2016-12-22 15:25:09 UTC Finished "bosh-init deploy /var/tempest/workspaces/default/deployments/bosh.yml"; Duration: 6510s; Exit Status: 1
Exited with 1.

FYI, I use Openstack Mitaka. When I run cf-openstack-validator, the validator fails to assign floating-ip to VM. But, I am able to assign floating-ip manually which means that the openstack environment is good enough to deploy the cloudfoundry.

@yagweb If you find any work around to this situation, can you help me with this? Thanks.

ganeshkaila commented 7 years ago

@yagweb The solution is described here. I have solved this problem by importing manually created ssh key to OpenStack.