cloudfoundry / capi-release

Bosh Release for Cloud Controller and friends
Apache License 2.0
24 stars 101 forks source link

cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers #267

Open ohkyle opened 2 years ago

ohkyle commented 2 years ago

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

cloud_controller_worker pre-backup-lock hangs when there are 10 or more cloud_controller_workers

Context

We ran into issues when trying use bbr to backup our cf deployment.

We deployed cf with this ops file:

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
  value: 10

- type: replace
  path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/jobs?/generic?/number_of_workers?
  value: 11

- type: replace
  path: /instance_groups/name=cc-worker/vm_type
  value: medium

We observed these logs

...
[bbr] 2022/09/13 15:31:55 INFO - Finished locking cloud_controller_clock on scheduler/b468a5cd-130a-442a-bcab-bfbc40cb4ab5 for backup.
[bbr] 2022/09/13 15:32:03 INFO - Finished locking cloud_controller_ng on api/dc671209-452f-43c9-8077-3927b614ffad for backup.
[bbr] 2022/09/13 15:32:06 INFO - Finished locking cloud_controller_ng on api/5859cb84-76ed-492f-b03f-023532aa352e for backup.

On the vm we see

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary
The Monit daemon 5.2.5 uptime: 2d 3h 22m

Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_2' running
Process 'cloud_controller_worker_3' running
Process 'cloud_controller_worker_4' running
Process 'cloud_controller_worker_5' running
Process 'cloud_controller_worker_6' running
Process 'cloud_controller_worker_7' running
Process 'cloud_controller_worker_8' running
Process 'cloud_controller_worker_9' running
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running
Process 'loggregator_agent'         running
Process 'loggr-forwarder-agent'     running
Process 'loggr-syslog-agent'        running
Process 'prom_scraper'              running
Process 'metrics-discovery-registrar' running
Process 'metrics-agent'             running
Process 'bosh-dns'                  running
Process 'bosh-dns-resolvconf'       running
Process 'bosh-dns-healthcheck'      running

On the vm we also see

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# /var/vcap/jobs/cloud_controller_worker/bin/bbr/pre-backup-lock
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...
Waiting for cloud_controller_worker_1 to be unmonitored...

Steps to Reproduce

  1. deploy cf with
    
    - type: replace
    path: /instance_groups/name=cc-worker/jobs/name=cloud_controller_worker/properties/cc/broker_client_default_async_poll_interval_seconds?
    value: 10

Expected result

We expected the command in step 2 to succeed.

Current result

Currently the command is failing.

Possible Fix

This piece of code

function wait_unmonitor_job() {
  local job_name="$1"

  while true; do
    if [[ $(sudo /var/vcap/bosh/bin/monit summary | grep ${job_name} ) =~ not[[:space:]]monitored[[:space:]]*$ ]]; then
      echo "Unmonitored ${job_name}"
      return 0
    else
      echo "Waiting for ${job_name} to be unmonitored..."
    fi

    sleep 0.1
  done
}

seems to assume that there will always be less than 10 cloud_controller_workers.

On our vm with 11 workers

cc-worker/ff736dfe-8c81-4528-98a6-5698ba183123:~# monit summary | grep cloud_controller_worker_1
Process 'cloud_controller_worker_1' not monitored
Process 'cloud_controller_worker_10' running
Process 'cloud_controller_worker_11' running
ohkyle commented 2 years ago

A potential fix for this bug

.*${job_name}.*[[:space:]]not[[:space:]]monitored[[:space:]]

I have attached a patch as well (since I do not have push rights to the repo) 0001-monit_utils-Add-support-for-more-than-10-cc-workers.txt