SpectatorD Does Not Clean Up Sockets After Timeouts

copperlight commented 1 year ago

We saw an issue where one node had thousands of sockets in a CLOSE_WAIT state, created by the latest release version of spectatord. This particular node was in a state where networking reliability was affected - SSH connections were flaky and there were many timeouts when spectatord was attempting to POST to the Atlas backend.

[2023-08-30 14:20:55.101] [spectator] [error] Failed to POST https://$HOST/api/v4/update: Timeout was reached
[2023-08-30 14:20:55.101] [spectator] [info] HTTP timeout to https://$HOST/api/v4/update: 1000ms elapsed - connect_to=1000 read_to=2000

It looks like spectatord is not cleaning up sockets after a timeout is reached.

This bug has probably been around for awhile.

copperlight commented 1 year ago

I cannot reproduce this condition on an Ubuntu Jammy system with a secondary spectatord process pointing to an invalid URL, so that connect timeouts are triggered.

sudo /usr/local/bin/spectatord --admin_port 1235 --enable_socket false --port 1235 --uri http://example.com:81

[2023-09-01 15:57:25.936] [spectator] [info] Starting to send metrics to http://example.com:81 every 5s.
[2023-09-01 15:57:25.936] [spectatord] [info] Starting admin server on port 1235/tcp
[2023-09-01 15:57:25.936] [spectator] [info] Publishing metrics with the following common tags: {}
[2023-09-01 15:57:25.936] [spectatord] [info] Starting janitorial tasks
[2023-09-01 15:57:25.936] [spectatord] [info] Using receive buffer size = 16777216
[2023-09-01 15:57:25.936] [spectatord] [info] Starting spectatord server on port 1235/udp
[2023-09-01 15:57:25.936] [spectatord] [info] statsd support is not enabled
[2023-09-01 15:57:25.936] [spectatord] [info] Starting local server (dgram) on socket /run/spectatord/spectatord.unix
[2023-09-01 15:57:31.639] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:31.639] [spectator] [info] HTTP timeout to http://example.com:81: 703ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:32.341] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:32.341] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:33.042] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 2.108s
[2023-09-01 15:57:36.636] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:36.636] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:37.338] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:37.338] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:38.040] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [info] Sent: 0 Dropped: 25 Total: 25. Elapsed 2.105s

copperlight commented 12 months ago

We had another two instances demonstrate this problem again, even if we cannot recreate it locally.

The error pair log lines look like, as reported by sudo journalctl -u spectatord |grep -i timeout:

Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [error] Failed to POST https://HOSTNAME/api/v4/update: Timeout was reached
Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [info] HTTP timeout to https://HOSTNAME/api/v4/update: 1597ms elapsed - connect_to=1000 read_to=2000

The slight difference here versus the repro attempt is that the actual connect duration is longer than the timeout, versus... not.

In one of the cases, the CLOSE_WAIT sockets appear to be using IP addresses that are no longer in use by either the load balancer or the aggregator clusters, which is strange. The sockets are clearly owned by the SpectatorD process, as reported by sudo netstat -tonp |grep CLOSE_WAIT.

Some commentary on cleaning up half-closed connections in libcurl:

https://curl-library.cool.haxx.narkive.com/8hWrXbx7/cleanup-half-closed-sockets-in-connection-cache
https://curl.se/mail/lib-2005-04/0064.html
https://curl.se/mail/tracker-2011-05/0006.html
https://stackoverflow.com/questions/70657452/libcurl-checking-the-server-shutdown-connection
- Setting a lower CURLOPT_MAXAGE_CONN may help, by preventing re-use of old connections, although the default is 118 seconds.
https://sourceforge.net/p/curl/bugs/558/
- Setting CURLOPT_FORBID_REUSE triggers explicit close during perform().
- Connections are also closed during cleanup().

copperlight commented 11 months ago

A few attempts to dig into this further:

Placing a curl_easy_cleanup() call after http_client.cc#L253 triggers failures in the test suite.
Connecting to a simple Python socket server which prematurely closes connections does not increase the number of CLOSE_WAIT sockets, and yields the following log messages:

[2023-09-21 21:25:51.705] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.705] [spectator] [info] HTTP timeout to http://localhost:2345: 8ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.707] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.707] [spectator] [info] HTTP timeout to http://localhost:2345: 1ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.708] [spectator] [info] HTTP timeout to http://localhost:2345: 0ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 0.013s

copperlight commented 2 months ago

The solution for this is to implement proper curl-multi in https://github.com/Netflix-Skunkworks/spectatord/issues/51.

Netflix-Skunkworks / spectatord

SpectatorD Does Not Clean Up Sockets After Timeouts #68