cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

Make BBS more resilient to API port being unavailable #812

Closed rroberts2222 closed 7 months ago

rroberts2222 commented 1 year ago

Summary

When another process has claimed BBS's listen_addr, a deployment can result in no BBS instances being available. The deployment does not fast fail like we would expect it to, because BBS does not attempt to listen on the port until it becomes the active node.

Steps to Reproduce

Deploy otel-collector job using this operations file and this metric exporter config:

prometheus:
  endpoint: 127.0.0.1:8889
  namespace: default

See the deploy fail after all of the bbs instances have rolled and bbs becomes completely unavailable.

Diego repo

bbs

Environment Details

diego-release 2.81.0 and loggregator-agent-release 7.6.0

Possible Causes or Fixes (optional)

Causes: The BBS node only listens on the API port when it has claimed the lock to become the active BBS node. This means that the job can roll without listening on the listen_addr and won't know another process is using it until it tries to become the active node.

Possible Fixes:

@acrmp

geofffranks commented 1 year ago

I have another option to propose. It would use linux-specific code (bbs only runs on linux VMs though).

We make BBS bind to the port on startup (or fail if something else has bound it), keep the port closed, and listen later, via something like this example. It's not pretty, but it does work.

This would guard against concerns from the automatic-listen at startup option, as well as race conditions between BBS + otel processes doing their port checking.

geofffranks commented 1 year ago

@mariash Also proposed having BBS listen on startup, but return a 500 (or other failure code) to clients when the instance is not the active node. This seems a lot simpler to implement and will be cross platform.

winkingturtle-vmw commented 7 months ago

I am going to close this issue since all related PRs have been merged. Please re-open if that's not the case.