cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.03k stars 657 forks source link

Add properties to set agent timeout #2401

Closed ogrand closed 1 year ago

ogrand commented 1 year ago

Is your feature request related to a problem? Please describe. When I use nats bosh director 3 steps rotation for some deployments, agent restart takes too long time with arping on public ips, and at the end the deployment failed.

=> Agent restart
2022-07-06_15:43:17.97473 [main] 2022/07/06 15:43:17 DEBUG - Starting agent
2022-07-06_15:43:18.00893 [File System] 2022/07/06 15:43:18 DEBUG - Checking if file exists /var/vcap/bosh/update_settings.json
2022-07-06_15:43:18.00895 [File System] 2022/07/06 15:43:18 DEBUG - Stat '/var/vcap/bosh/update_settings.json'
2022-07-06_15:43:18.00896 [File System] 2022/07/06 15:43:18 DEBUG - Reading file /var/vcap/bosh/update_settings.json
2022-07-06_15:43:18.07189 [linuxPlatform] 2022/07/06 15:43:18 DEBUG - Getting remaining size of `/dev/vda'
2022-07-06_15:43:18.07189 [RootDevicePartitioner] 2022/07/06 15:43:18 DEBUG - Getting size of disk remaining after first partition

=> First request from bosh director (2mn15 before error)
D, [2022-07-06T15:43:26.790172 #24901] [canary_update(osb-relay/e223214f-e91f-4c7f-b14a-9aebcfc46295 (0))] DEBUG -- DirectorJobRunner: SENT: agent.827809e1-7482-45d7-938a-0652f13a74aa {"protocol":3,"method":"get_task","arguments":["3fb7641f-95ca-41ff-76fd-ec8951bf6bcd"],"reply_to":"director.b315c26e-a011-4d30-8bf1-ac2b84db5a08.827809e1-7482-45d7-938a-0652f13a74aa.986169d6-bab8-4326-b6e0-88ee57e36850"}

=> Arping for 1mn30
2022-07-06_15:43:18.22706 [Cmd Runner] 2022/07/06 15:43:18 DEBUG - Running command 'sync-time'
2022-07-06_15:43:23.07507 [Cmd Runner] 2022/07/06 15:43:23 DEBUG - Running command 'arping -c 1 -U -I eth0 10.228.194.11'
2022-07-06_15:43:23.07641 [Cmd Runner] 2022/07/06 15:43:23 DEBUG - Stdout: ARPING 10.228.194.11 from 10.228.194.11 eth0
2022-07-06_15:43:23.07641 Sent 1 probes (1 broadcast(s))
2022-07-06_15:43:23.07642 Received 0 response(s)
...
2022-07-06_15:44:48.10878 [Cmd Runner] 2022/07/06 15:44:48 DEBUG - Running command 'arping -c 1 -U -I eth0 10.228.194.11'
2022-07-06_15:44:48.10973 [Cmd Runner] 2022/07/06 15:44:48 DEBUG - Stdout: ARPING 10.228.194.11 from 10.228.194.11 eth0
2022-07-06_15:44:48.11042 Sent 1 probes (1 broadcast(s))
2022-07-06_15:44:48.11043 Received 0 response(s)

=> Connect to nats
2022-07-06_15:44:48.81492 [NATS Handler] 2022/07/06 15:44:48 INFO - Attempting to connect to NATS
2022-07-06_15:44:48.81545 [NATS Handler] 2022/07/06 15:44:48 INFO - Sending hm message 'heartbeat'
2022-07-06_15:44:48.81633 [NATS Handler] 2022/07/06 15:44:48 DEBUG - Cleaned ip-mac address cache for: 192.168.116.158.
2022-07-06_15:44:48.82997 [NATS Handler] 2022/07/06 15:44:48 INFO - Subscribing to agent.827809e1-7482-45d7-938a-0652f13a74aa
2022-07-06_15:45:18.81928 [NATS Handler] 2022/07/06 15:45:18 INFO - Sending hm message 'heartbeat'

=> Timeout after 45s, then bosh director error
E, [2022-07-06T15:45:41.797723 #24901] [canary_update(osb-relay/e223214f-e91f-4c7f-b14a-9aebcfc46295 (0))] ERROR -- DirectorJobRunner: Error updating canary instance: #<Bosh::Director::RpcTimeout: Timed out sending 'get_task' to instance: 'osb-relay/e223214f-e91f-4c7f-b14a-9aebcfc46295', agent-id: '827809e1-7482-45d7-938a-0652f13a74aa' after 45 seconds>

To workaround this problem, I patch agent_client.rb ruby script to change agent start timeout and retries values:

FILE="/var/vcap/data/packages/director/*/gem_home/ruby/*/gems/bosh-director-*/lib/bosh/director/agent_client.rb"
RUBY_PATH="$(dirname ${FILE})"
RUBY_SCRIPT="$(basename ${FILE})"
RUBY_SCRIPT="${RUBY_PATH}/${RUBY_SCRIPT}"
sed -i "s+GET_TASK_MAX_RETRIES =.*+GET_TASK_MAX_RETRIES = 3+g" ${RUBY_SCRIPT}
sed -i "s+@timeout = options.*+@timeout = options[:timeout] || 60+g" ${RUBY_SCRIPT}

Describe the solution you'd like It would be nice if we can use 2 new bosh properties to customize agent timeout instead of dirty patch on bosh release:

Thanks

mrosecrance commented 1 year ago

Which director version are you running on?

poblin-orange commented 1 year ago

We are running bosh/273.0.0, with bosh-vsphere-esxi-ubuntu-bionic-go_agent/1.97

ae-govau commented 1 year ago

Since bumping to the bosh-deployment that uses NATS 2.0 (first with bionic) and now with Jammy, we are regularly seeing errors during compilation with:

Error: Timed out sending 'compile_package' to instance: 'compilation-xxx', agent-id: 'yyy' after 45 seconds

If we re-try a few times, eventually it seems to work.

I'm thinking it genuinely is a timeout, and would really like to be able to try extending it past the 45 seconds.

Would love to see some properties for this, particularly for the agent used on the compilation VMs.

(in case it's related, the VMs we are dealing with don't have outbound internet connectivity)

ae-govau commented 1 year ago

We're going to try this pre-start script on our director deploys and see if it helps with the next round of stemcell updates...

- type: replace
  path: /instance_groups/name=bosh/jobs/-
  value:
    name: pre-start-script
    release: os-conf
    properties:
      script: |
        #!/usr/bin/env bash
        set -euo pipefail

        # Bump agent timeout
        sed -i -e "s/\@timeout = options\[\:timeout\] || 45/\@timeout = options\[\:timeout\] || 600/g" /var/vcap/packages/director/gem_home/ruby/3.1.0/gems/bosh-director-0.0.0/lib/bosh/director/agent_client.rb
rkoster commented 1 year ago

This PR has just been merged: https://github.com/cloudfoundry/bosh/pull/2406