Closed copperlight closed 2 months ago
I cannot reproduce this condition on an Ubuntu Jammy system with a secondary spectatord
process pointing to an invalid URL, so that connect timeouts are triggered.
sudo /usr/local/bin/spectatord --admin_port 1235 --enable_socket false --port 1235 --uri http://example.com:81
[2023-09-01 15:57:25.936] [spectator] [info] Starting to send metrics to http://example.com:81 every 5s.
[2023-09-01 15:57:25.936] [spectatord] [info] Starting admin server on port 1235/tcp
[2023-09-01 15:57:25.936] [spectator] [info] Publishing metrics with the following common tags: {}
[2023-09-01 15:57:25.936] [spectatord] [info] Starting janitorial tasks
[2023-09-01 15:57:25.936] [spectatord] [info] Using receive buffer size = 16777216
[2023-09-01 15:57:25.936] [spectatord] [info] Starting spectatord server on port 1235/udp
[2023-09-01 15:57:25.936] [spectatord] [info] statsd support is not enabled
[2023-09-01 15:57:25.936] [spectatord] [info] Starting local server (dgram) on socket /run/spectatord/spectatord.unix
[2023-09-01 15:57:31.639] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:31.639] [spectator] [info] HTTP timeout to http://example.com:81: 703ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:32.341] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:32.341] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:33.042] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 2.108s
[2023-09-01 15:57:36.636] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:36.636] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:37.338] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:37.338] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:38.040] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [info] Sent: 0 Dropped: 25 Total: 25. Elapsed 2.105s
We had another two instances demonstrate this problem again, even if we cannot recreate it locally.
The error pair log lines look like, as reported by sudo journalctl -u spectatord |grep -i timeout
:
Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [error] Failed to POST https://HOSTNAME/api/v4/update: Timeout was reached
Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [info] HTTP timeout to https://HOSTNAME/api/v4/update: 1597ms elapsed - connect_to=1000 read_to=2000
The slight difference here versus the repro attempt is that the actual connect duration is longer than the timeout, versus... not.
In one of the cases, the CLOSE_WAIT
sockets appear to be using IP addresses that are no longer in use by either the load balancer or the aggregator clusters, which is strange. The sockets are clearly owned by the SpectatorD process, as reported by sudo netstat -tonp |grep CLOSE_WAIT
.
Some commentary on cleaning up half-closed connections in libcurl:
CURLOPT_MAXAGE_CONN
may help, by preventing re-use of old connections, although the default is 118 seconds.CURLOPT_FORBID_REUSE
triggers explicit close during perform()
.cleanup()
.A few attempts to dig into this further:
curl_easy_cleanup()
call after http_client.cc#L253 triggers failures in the test suite.[2023-09-21 21:25:51.705] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.705] [spectator] [info] HTTP timeout to http://localhost:2345: 8ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.707] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.707] [spectator] [info] HTTP timeout to http://localhost:2345: 1ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.708] [spectator] [info] HTTP timeout to http://localhost:2345: 0ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 0.013s
The solution for this is to implement proper curl-multi in https://github.com/Netflix-Skunkworks/spectatord/issues/51.
We saw an issue where one node had thousands of sockets in a
CLOSE_WAIT
state, created by the latest release version ofspectatord
. This particular node was in a state where networking reliability was affected - SSH connections were flaky and there were many timeouts whenspectatord
was attempting toPOST
to the Atlas backend.It looks like
spectatord
is not cleaning up sockets after a timeout is reached.This bug has probably been around for awhile.