pollSpotTermination leaks sockets

Version: 2.0.1 (and 2.0.2)

In https://github.com/buildkite/lifecycled/blob/master/spot.go#L38 we are doing HTTP requests in a loop and deferring calls to res.Body.Close(). Deferred calls are not run until the function returns, which this function does not do until it is interrupted by a signal.

The result is that lsof -i outputs the following after lifecycled has run for a while:

lifecycle 29348   root  776u  IPv4 878016      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36026->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  778u  IPv4 878049      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36038->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  780u  IPv4 878079      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36068->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  781u  IPv4 878650      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36074->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  782u  IPv4 878295      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36084->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  783u  IPv4 878297      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36088->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  784u  IPv4 878660      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36092->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  785u  IPv4 878670      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36114->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  786u  IPv4 878677      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36122->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  787u  IPv4 878730      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36136->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  788u  IPv4 878390      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36140->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  789u  IPv4 878415      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36146->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  790u  IPv4 878789      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36154->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  791u  IPv4 878435      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36176->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348   root  792u  IPv4 878443      0t0  TCP ip-10-11-32-57.eu-west-1.compute.internal:36182->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)

And in following causes:

Sep 04 04:46:18 ip-10-11-32-57.eu-west-1.compute.internal lifecycled[9081]: time="2018-09-04T04:46:18Z" level=info msg="Failed to query metadata service" error="Get http://169.254.169.254/latest/meta-data/spot/termination-time: dial tcp 169.254.169.254:80: socket: too many open files"
Sep 04 04:46:21 ip-10-11-32-57.eu-west-1.compute.internal lifecycled[9081]: time="2018-09-04T04:46:21Z" level=fatal msg="Failed to delete queue: RequestError: send request failed\ncaused by: Post https://sqs.eu-west-1.amazonaws.com/: dial tcp: lookup sqs.eu-west-1.amazonaws.com on 10.11.0.2:53: dial udp 10.11.0.2:53: socket: too many open files"

Solution

Don't use defer inside a loop, and instead close the body explicitly after reading it.
Use the built in ec2metadata service in AWS SDK GO.

I can make a PR for the first one, and then a new PR if we decide to switch to number 2 - let me know what you want.

Interesting, when I look at the logs for an instance that has failed to cleanup it's queue, I see this:

Sep 4 20:55:23 ip-172-31-30-99 lifecycled: time="2018-09-04T20:55:23Z" level=info msg="Failed to query metadata service" error="Get http://169.254.169.254/latest/meta-data/spot/termination-time: dial tcp 169.254.169.254:80: socket: too many open files"

Sep 4 20:57:14 ip-172-31-30-99 lifecycled: time="2018-09-04T20:57:14Z" level=error msg="Failed to delete queue" error="RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: dial tcp: lookup sqs.us-east-1.amazonaws.com on 10.0.0.2:53: dial udp 10.0.0.2:53: socket: too many open files"

Sep 4 20:57:14 ip-172-31-30-99 lifecycled: lifecycled: error: RequestError: send request failed

Sep 4 20:57:14 ip-172-31-30-99 lifecycled: caused by: Post https://sqs.us-east-1.amazonaws.com/: dial tcp: lookup sqs.us-east-1.amazonaws.com on 10.0.0.2:53: no such host, try --help

Sep 4 20:57:14 ip-172-31-30-99 systemd: lifecycled.service: main process exited, code=exited, status=1/FAILURE

Sep 4 20:57:14 ip-172-31-30-99 systemd: Unit lifecycled.service entered failed state.

Sep 4 20:57:14 ip-172-31-30-99 systemd: lifecycled.service failed.

Perhaps this open file issue is the missing key?

buildkite / lifecycled

pollSpotTermination leaks sockets #34

Solution