Closed itsdalmo closed 6 years ago
Interesting, when I look at the logs for an instance that has failed to cleanup it's queue, I see this:
Sep 4 20:55:23 ip-172-31-30-99 lifecycled: time="2018-09-04T20:55:23Z" level=info msg="Failed to query metadata service" error="Get http://169.254.169.254/latest/meta-data/spot/termination-time: dial tcp 169.254.169.254:80: socket: too many open files"
Sep 4 20:57:14 ip-172-31-30-99 lifecycled: time="2018-09-04T20:57:14Z" level=error msg="Failed to delete queue" error="RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: dial tcp: lookup sqs.us-east-1.amazonaws.com on 10.0.0.2:53: dial udp 10.0.0.2:53: socket: too many open files"
Sep 4 20:57:14 ip-172-31-30-99 lifecycled: lifecycled: error: RequestError: send request failed
Sep 4 20:57:14 ip-172-31-30-99 lifecycled: caused by: Post https://sqs.us-east-1.amazonaws.com/: dial tcp: lookup sqs.us-east-1.amazonaws.com on 10.0.0.2:53: no such host, try --help
Sep 4 20:57:14 ip-172-31-30-99 systemd: lifecycled.service: main process exited, code=exited, status=1/FAILURE
Sep 4 20:57:14 ip-172-31-30-99 systemd: Unit lifecycled.service entered failed state.
Sep 4 20:57:14 ip-172-31-30-99 systemd: lifecycled.service failed.
Perhaps this open file issue is the missing key?
Perhaps this open file issue is the missing key?
I believe so 👍
Version:
2.0.1
(and2.0.2
)In https://github.com/buildkite/lifecycled/blob/master/spot.go#L38 we are doing HTTP requests in a loop and deferring calls to
res.Body.Close()
. Deferred calls are not run until the function returns, which this function does not do until it is interrupted by a signal.The result is that
lsof -i
outputs the following afterlifecycled
has run for a while:And in following causes:
Solution
defer
inside a loop, and instead close the body explicitly after reading it.ec2metadata
service in AWS SDK GO.I can make a PR for the first one, and then a new PR if we decide to switch to number 2 - let me know what you want.