concourse / concourse-bosh-deployment

A toolchain for deploying Concourse with BOSH.
Apache License 2.0
86 stars 155 forks source link

Fix `max containers reached` on External Workers on Dynamic Networks #87

Closed cunnie closed 6 years ago

cunnie commented 6 years ago

Fixes max containers reached on external workers deployed to BOSH dynamic networks.

Problem description: Worker will mistakenly attempt to connect to garden (gdn) over the ethernet interface, get ECONN, be unable to reap containers. Instead, worker should attempt to connect to garden over loopback; that's where it's listening.

This PR fixes, from /var/vcap/sys/log/worker/beacon.stdout.log:

{"timestamp":"1534476254.482047796","source":"worker","message":"worker.sweeper.failed-to-report-containers","log_level":2,"data":{"error":"Get http://api/containers: dial tcp 10.2.0.154:7777: connect: connection refused","session":"3"}}

Unlike the cluster deployment, where garden binds to 0.0.0.0:7777 (all interfaces), the external worker deployment's garden has been locked down to only bind to address 127.0.0.1 (see this commit).

I believe that this problem is caused by a strange interaction of a BOSH dynamic network, BOSH links, and the garden.address property which defaults to the instance address when the link is not found, for I see the following message when I deploy:

Task 255 | 04:25:35 | Preparing deployment: Preparing deployment
Task 255 | 04:25:36 | Warning: IP address not available for the link provider instance: worker/34c91062-2801-40cf-bd59-a6d92c2f55d1

Also, the description for the worker job property (garden.address) indicates that it defaults to the BOSH link, "If not specified, either the garden link is used, or the instance's address is advertised if the link is not found" (emphasis mine).

This PR should not have any untoward effects on existing external worker deployments; garden binds to the loopback address, and this PR configures the worker to connect to garden via the loopback address.

[2018-08-17 Updated for clarity]

vito commented 6 years ago

thanks!

vito commented 6 years ago

FWIW I think there's some overlap here: https://github.com/concourse/concourse/issues/2437 - which will be in the next release

cunnie commented 6 years ago

Yeah, it looks like the exact issue I had: No garbage collection because ECONN on port 7777.