cloudfoundry / bosh-bootloader

Command line utility for standing up a BOSH director on an IAAS of your choice.
Apache License 2.0
180 stars 180 forks source link

AWS: bbl up fails with error while setting up BOSH director #589

Closed harshcommits closed 11 months ago

harshcommits commented 1 year ago

I have been attempting an installation of CF by setting up a BOSH director on AWS, referring to the following documentation: https://docs.cloudfoundry.org/deploying/common/aws.html

The installation keeps on failing while trying to deploy the BOSH director, with the error given below:

Creating VM for instance 'bosh/0' from stemcell 'ami-02d4bec6e9057caaa'... Finished (00:00:33)
  Waiting for the agent on VM 'i-0020773ae28921319' to be ready... Finished (00:00:17)
  Attaching disk 'vol-07837629b3efda103' to VM 'i-0020773ae28921319'... Finished (00:00:21)
  Rendering job templates... Finished (00:00:12)
  Compiling package 'golang-1-linux/2598f3c61c5a9e674fb01f9d5c11ecdbaeb18622b0296ade3fed5e774ee15421'... Skipped [Package already compiled] (00:00:12)
  Compiling package 'tini/3d7b02f3eeb480b9581bec4a0096dab9ebdfa4bc'... Skipped [Package already compiled] (00:00:00)
  Compiling package 'bpm-runc/9f66395d85ace4b4d4908069742f7db27dc28d0a'... Skipped [Package already compiled] (00:00:02)
  Compiling package 'aws-cpi-ruby-3.1/8b225e7cc2608305a7b784b5828b2b4b7c7adc3eb14af46e313d64a9e14a3ad6'... Failed (00:12:17)
Failed deploying (00:15:00)

Cleaning up rendered CPI jobs... Finished (00:00:00)

Deploying:
  Building state for instance 'bosh/0':
    Compiling job package dependencies for instance 'bosh/0':
      Compiling job package dependencies:
        Remotely compiling package 'aws-cpi-ruby-3.1' with the agent:
          Sending 'compile_package' to the agent:
            Sending 'get_task' to the agent:
              Performing request to agent:
                Performing POST request:
                  Post "https://mbus:<redacted>@10.0.0.6:6868/agent": EOF

Exit code 1

Would really appreciate some support here.

jpalermo commented 1 year ago

I feel like I've seen the EOF error when there is a second VM using the same IP address. I'm not sure that would be possible in AWS using bbl like this, but it might be a good idea to double check there isn't a second VM.

If that's not the problem, you'll probably need to get the agent logs from the bosh VM. Since the VM didn't fully configure itself, you'll probably need to use the instructions here to gain access to the logs.

harshcommits commented 1 year ago

I didn't find a second VM using the same IP address. However, this error usually comes up when I attempt a re-install after a fresh installation gets stuck at this error:

Started deploying
  Creating VM for instance 'bosh/0' from stemcell 'ami-0ff08728f60308f36'... Finished (00:00:33)
  Waiting for the agent on VM 'i-0ea93202dcf1faf68' to be ready... Failed (00:10:09)
Failed deploying (00:10:42)

Cleaning up rendered CPI jobs... Finished (00:00:00)

Deploying:
  Creating instance 'bosh/0':
    Waiting until instance is ready:
      Post "https://mbus:<redacted>@10.0.0.6:6868/agent": Creating SOCKS5 dialer: get host key: dial tcp 44.228.214.175:22: i/o timeout

Exit code 1

At the second attempt, this step executes successfully only to throw the aforementioned error. I am not sure if that might be causing the issue, but that is one possibility I could think of.

While going through the logs following the steps in the link you shared, this error in particular kept on repeating throughout:

2023-10-06_02:26:57.19866 [File System] 2023/10/06 02:26:57 DEBUG - Checking if file exists /var/vcap/bosh/spec.json
2023-10-06_02:26:57.19866 [File System] 2023/10/06 02:26:57 DEBUG - Stat '/var/vcap/bosh/spec.json'
2023-10-06_02:26:57.19867 [File System] 2023/10/06 02:26:57 DEBUG - Reading file /var/vcap/bosh/spec.json
2023-10-06_02:26:57.19867 [File System] 2023/10/06 02:26:57 DEBUG - Read content
2023-10-06_02:26:57.19867 ********************
2023-10-06_02:26:57.19868 {"properties":{"logging":{"max_log_file_size":""}},"job":{"name":"bosh","release":"","template":"","version":"","templates":[]},"packages":{},"configuration_hash":"unused-configuration-hash","networks":{"default":{"cloud_properties":{"subnet":"subnet-025f0ad00b3db32aa"},"default":["dns","gateway"],"dns":["8.8.8.8"],"gateway":"10.0.0.1","ip":"10.0.0.6","netmask":"255.255.255.0","type":"manual"}},"resource_pool":null,"deployment":"bosh","name":"bosh","index":0,"id":"0","az":"unknown","persistent_disk":0,"rendered_templates_archive":{"sha1":null,"blobstore_id":""}}
2023-10-06_02:26:57.19868 ********************
2023-10-06_02:26:57.19869 [File System] 2023/10/06 02:26:57 DEBUG - Writing /var/vcap/instance/health.json
2023-10-06_02:26:57.19869 [File System] 2023/10/06 02:26:57 DEBUG - Making dir /var/vcap/instance with perm 0777
2023-10-06_02:26:57.19869 [File System] 2023/10/06 02:26:57 DEBUG - Write content
2023-10-06_02:26:57.19869 ********************
2023-10-06_02:26:57.19870 {"state":"running"}
2023-10-06_02:26:57.19870 ********************
2023-10-06_02:26:57.19871 [attemptRetryStrategy] 2023/10/06 02:26:57 DEBUG - Making attempt #0 for *retrystrategy.retryable
2023-10-06_02:26:57.19871 [agent] 2023/10/06 02:26:57 INFO - Attempting to send Heartbeat
2023-10-06_02:27:27.19722 [monitJobSupervisor] 2023/10/06 02:27:27 DEBUG - Getting monit status
2023-10-06_02:27:27.19740 [http-client] 2023/10/06 02:27:27 DEBUG - status function called
2023-10-06_02:27:27.19741 [http-client] 2023/10/06 02:27:27 DEBUG - Monit request: url='http://127.0.0.1:2822/_status2?format=xml' body=''
2023-10-06_02:27:27.19741 [attemptRetryStrategy] 2023/10/06 02:27:27 DEBUG - Making attempt #0 for *httpclient.RequestRetryable
2023-10-06_02:27:27.19742 [clientRetryable] 2023/10/06 02:27:27 DEBUG - [requestID=2ab2dacd-745c-4894-4ef3-d67db6f324b4] Requesting (attempt=1): Request{ Method: 'GET', URL: 'http://127.0.0.1:2822/_status2?format=xml' }
2023-10-06_02:27:27.19812 [http-client] 2023/10/06 02:27:27 DEBUG - Unmarshalled Monit status: {XMLName:{Space: Local:service} Name:system_3f6ded23-4759-4b5f-5060-578fedd91faf Pending:0 Status:0 StatusMessage: Monitor:1 Uptime:0 Children:0 Memory:{XMLName:{Space: Local:} Percent:0 PercentTotal:0 Kilobyte:0 KilobyteTotal:0} CPU:{XMLName:{Space: Local:} Percent:0 PercentTotal:0}}
2023-10-06_02:27:27.19814 [File System] 2023/10/06 02:27:27 DEBUG - Checking if file exists /var/vcap/monit/stopped
2023-10-06_02:27:27.19814 [File System] 2023/10/06 02:27:27 DEBUG - Stat '/var/vcap/monit/stopped'
2023-10-06_02:27:27.19814 [agent] 2023/10/06 02:27:27 DEBUG - Building heartbeat
2023-10-06_02:27:27.19825 [File System] 2023/10/06 02:27:27 DEBUG - Reading file /proc/mounts
2023-10-06_02:27:27.19846 [File System] 2023/10/06 02:27:27 DEBUG - Read content
2023-10-06_02:27:27.19846 ********************

While trying to read the contents of /var/vcap/bosh/spec.json, the output seems to be missing the resource pool, AZ and other details and eventually seems to timeout with an error. There wasn't anything in the logs to indicate why that was happening though.

jpalermo commented 11 months ago

That seems like there may just be some communication problem between where you are running the command and the jumpbox the communication to the director is trying to go through.

Are there any firewalls that might be causing problems?

The agent logs you pasted are totally normal. That's just the heartbeat that goes every minute or so checking the status of the VM. If there are any errors in there it might point to the problem. If there are no errors in there, it almost certainly means there is a communication problem of some kind and the commands to control the agent are never even making it there.

harshcommits commented 11 months ago

I had the same suspicion, since the commands were being run from a machine on the internal network. I tried to run it from an EC2 instance from the same region, with the same effect.

Regardless, that i/o timeout gets fixed during a second attempt only to fail at the original EOF error at get_task. I will take another look into the network config, since that seems to be the most plausible explanation.

jpalermo commented 11 months ago

Running with the environment variable BOSH_LOG_LEVEL=debug will get you a lot more output from the CLI. I don't expect it to be super helpful in this type of situation, but it might be worth taking a look.

harshcommits commented 11 months ago

After a few instances of attempting an installation from scratch, and cleaning up the config folders I was able to attempt an installation. I am still not sure what exactly caused the issue, but I will be closing this for now.

Thanks for all the help @jpalermo.