Netflix-Skunkworks / spectatord

A high performance metrics daemon
Apache License 2.0
24 stars 5 forks source link

SpectatorD Does Not Clean Up Sockets After Timeouts #68

Closed copperlight closed 2 months ago

copperlight commented 1 year ago

We saw an issue where one node had thousands of sockets in a CLOSE_WAIT state, created by the latest release version of spectatord. This particular node was in a state where networking reliability was affected - SSH connections were flaky and there were many timeouts when spectatord was attempting to POST to the Atlas backend.

[2023-08-30 14:20:55.101] [spectator] [error] Failed to POST https://$HOST/api/v4/update: Timeout was reached
[2023-08-30 14:20:55.101] [spectator] [info] HTTP timeout to https://$HOST/api/v4/update: 1000ms elapsed - connect_to=1000 read_to=2000

It looks like spectatord is not cleaning up sockets after a timeout is reached.

This bug has probably been around for awhile.

copperlight commented 1 year ago

I cannot reproduce this condition on an Ubuntu Jammy system with a secondary spectatord process pointing to an invalid URL, so that connect timeouts are triggered.

sudo /usr/local/bin/spectatord --admin_port 1235 --enable_socket false --port 1235 --uri http://example.com:81
[2023-09-01 15:57:25.936] [spectator] [info] Starting to send metrics to http://example.com:81 every 5s.
[2023-09-01 15:57:25.936] [spectatord] [info] Starting admin server on port 1235/tcp
[2023-09-01 15:57:25.936] [spectator] [info] Publishing metrics with the following common tags: {}
[2023-09-01 15:57:25.936] [spectatord] [info] Starting janitorial tasks
[2023-09-01 15:57:25.936] [spectatord] [info] Using receive buffer size = 16777216
[2023-09-01 15:57:25.936] [spectatord] [info] Starting spectatord server on port 1235/udp
[2023-09-01 15:57:25.936] [spectatord] [info] statsd support is not enabled
[2023-09-01 15:57:25.936] [spectatord] [info] Starting local server (dgram) on socket /run/spectatord/spectatord.unix
[2023-09-01 15:57:31.639] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:31.639] [spectator] [info] HTTP timeout to http://example.com:81: 703ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:32.341] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:32.341] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:33.042] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 2.108s
[2023-09-01 15:57:36.636] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:36.636] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:37.338] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:37.338] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:38.040] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [info] Sent: 0 Dropped: 25 Total: 25. Elapsed 2.105s
copperlight commented 12 months ago

We had another two instances demonstrate this problem again, even if we cannot recreate it locally.

The error pair log lines look like, as reported by sudo journalctl -u spectatord |grep -i timeout:

Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [error] Failed to POST https://HOSTNAME/api/v4/update: Timeout was reached
Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [info] HTTP timeout to https://HOSTNAME/api/v4/update: 1597ms elapsed - connect_to=1000 read_to=2000

The slight difference here versus the repro attempt is that the actual connect duration is longer than the timeout, versus... not.

In one of the cases, the CLOSE_WAIT sockets appear to be using IP addresses that are no longer in use by either the load balancer or the aggregator clusters, which is strange. The sockets are clearly owned by the SpectatorD process, as reported by sudo netstat -tonp |grep CLOSE_WAIT.

Some commentary on cleaning up half-closed connections in libcurl:

copperlight commented 11 months ago

A few attempts to dig into this further:

[2023-09-21 21:25:51.705] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.705] [spectator] [info] HTTP timeout to http://localhost:2345: 8ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.707] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.707] [spectator] [info] HTTP timeout to http://localhost:2345: 1ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.708] [spectator] [info] HTTP timeout to http://localhost:2345: 0ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 0.013s
copperlight commented 2 months ago

The solution for this is to implement proper curl-multi in https://github.com/Netflix-Skunkworks/spectatord/issues/51.