Closed mabrikan closed 1 year ago
Also it is worth noting that Go routines value is also high anycable_go_goroutines_num=6983.
The number of goroutines is O(N) of the number of connections. That means, the connections are still "alive" from the anycable-go server perspective (and thus, the metrics look correct).
You turned off the traffic, but the ingress (load balancer) itself was running, right? It could still keep connections to the backend (anycable-go), thus, making them look alive.
For how long have you been monitoring anycable-go after dropping services? Another hypothesis is that the connections got stuck in a half-closed state (due to non-client initiated disconnect), and they could stay in this state for dozens of minutes (depending on the OS tcp settings, see https://docs.anycable.io/anycable-go/os_tuning?id=tcp-keepalive).
Hi @palkan, and thank you for your response.
I just want to clarify that what happened is that we physically deleted the ALB to make sure there is no traffic going to the affected instance, that is why we were surprised to see number of clients not changing.
After deleting the ALB, we monitored the instance for 24-hours and no change happened.
Another hypothesis is that the connections got stuck in a half-closed state (due to non-client initiated disconnect), and they could stay in this state for dozens of minutes (depending on the OS tcp settings, see https://docs.anycable.io/anycable-go/os_tuning?id=tcp-keepalive).
Inspecting the OS, here are the variables:
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
Correct me if I'm wrong, but according to this, I think after ~2hours and a minute or so, number of clients should drop, which was not the case from our monitoring.
number of clients should drop, which was not the case from our monitoring.
Yeah, it should.
Just to confirm: you see the fresh data points in the monitoring system reflecting the huge numbers of goroutines/connections (not some interpolation)?
The goroutines_num
value is updated periodically and highly unlikely to show stale system information. So, something happens at the connection handling layer. Do you have any logs (at the time you shut down the load balancer)?
No, unfortunately we don't have the logs.
What we did is that we implemented a monkey-patch by using keepAlive
probe in the pod for anycable-go
.
The probe sends a WebSockets connection to localhost
with a specific timeout.
If the timeout is reached (i.e. most probably disconnect_queue
and goroutines_num
are high), the pod will restart.
Not a perfect solution, but it gets the job done 😄
Closing as stale
Tell us about your environment
AnyCable-Go version: 1.2.0
AnyCable gem version: 1.3
What did you do?
We had an incidents where disconnect queue size becomes so large and doesn't decrease. As part of debugging, we disabled all connection to anycable-go pods in k8s.
anycable-go
.Still, after doing all this, metrics
anycable_go_clients_num
andanycable_go_clients_uniq_num
are showing data and number of clients is huge (anycable_go_clients_num=5326
andanycable_go_clients_uniq_num=2512
)I don't know if this is an instrumentation bug or if the app really is stuck in some way. Also it is worth noting that Go routines value is also high
anycable_go_goroutines_num=6983
.What did you expect to happen?
Number of clients should be 0.
What actually happened?
Number of clients did not change.