anycable / anycable-go

AnyCable real-time server
https://anycable.io
MIT License
377 stars 65 forks source link

Metrics anycable_go_clients_num and anycable_go_clients_uniq_num count are not update correctly #160

Closed mabrikan closed 1 year ago

mabrikan commented 1 year ago

Tell us about your environment

AnyCable-Go version: 1.2.0

AnyCable gem version: 1.3

What did you do?

We had an incidents where disconnect queue size becomes so large and doesn't decrease. As part of debugging, we disabled all connection to anycable-go pods in k8s.

  1. We removed the ingress resource that creates ALB in AWS (we use alb-ingress-controller).
  2. We removed the service that exposes anycable-go.

Still, after doing all this, metrics anycable_go_clients_num and anycable_go_clients_uniq_num are showing data and number of clients is huge (anycable_go_clients_num=5326 and anycable_go_clients_uniq_num=2512)

I don't know if this is an instrumentation bug or if the app really is stuck in some way. Also it is worth noting that Go routines value is also high anycable_go_goroutines_num=6983.

What did you expect to happen?

Number of clients should be 0.

What actually happened?

Number of clients did not change.

palkan commented 1 year ago

Also it is worth noting that Go routines value is also high anycable_go_goroutines_num=6983.

The number of goroutines is O(N) of the number of connections. That means, the connections are still "alive" from the anycable-go server perspective (and thus, the metrics look correct).

You turned off the traffic, but the ingress (load balancer) itself was running, right? It could still keep connections to the backend (anycable-go), thus, making them look alive.

For how long have you been monitoring anycable-go after dropping services? Another hypothesis is that the connections got stuck in a half-closed state (due to non-client initiated disconnect), and they could stay in this state for dozens of minutes (depending on the OS tcp settings, see https://docs.anycable.io/anycable-go/os_tuning?id=tcp-keepalive).

mabrikan commented 1 year ago

Hi @palkan, and thank you for your response.

I just want to clarify that what happened is that we physically deleted the ALB to make sure there is no traffic going to the affected instance, that is why we were surprised to see number of clients not changing.

After deleting the ALB, we monitored the instance for 24-hours and no change happened.

Another hypothesis is that the connections got stuck in a half-closed state (due to non-client initiated disconnect), and they could stay in this state for dozens of minutes (depending on the OS tcp settings, see https://docs.anycable.io/anycable-go/os_tuning?id=tcp-keepalive).

Inspecting the OS, here are the variables:

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Correct me if I'm wrong, but according to this, I think after ~2hours and a minute or so, number of clients should drop, which was not the case from our monitoring.

palkan commented 1 year ago

number of clients should drop, which was not the case from our monitoring.

Yeah, it should.

Just to confirm: you see the fresh data points in the monitoring system reflecting the huge numbers of goroutines/connections (not some interpolation)?

The goroutines_num value is updated periodically and highly unlikely to show stale system information. So, something happens at the connection handling layer. Do you have any logs (at the time you shut down the load balancer)?

mabrikan commented 1 year ago

No, unfortunately we don't have the logs.

What we did is that we implemented a monkey-patch by using keepAlive probe in the pod for anycable-go.

The probe sends a WebSockets connection to localhost with a specific timeout.

If the timeout is reached (i.e. most probably disconnect_queue and goroutines_num are high), the pod will restart.

Not a perfect solution, but it gets the job done 😄

palkan commented 1 year ago

Closing as stale