Implement a prober that calls the /healthz endpoint of the API server in the same connection periodically as long as the connection is cleanly closed, reset or becomes a black hole: the server simply does not respond. The prober should expose a counter metric how often a connection became a black hole.
Why is this needed:
Currently we do not have monitoring for broken long running TCP connections. We observed that client-go and golang net/http#Transport tend to reuse the TCP connections by default and if a connection becomes a black hole, it takes minutes for the clients to detect that until they eventually reconnect. If long running connections (e.g. watch) can break in a similar way, that could cause subtle issues that are probably hard to debug. Metrics about broken connections could be helpful in this case.
What would you like to be added:
Implement a prober that calls the /healthz endpoint of the API server in the same connection periodically as long as the connection is cleanly closed, reset or becomes a black hole: the server simply does not respond. The prober should expose a counter metric how often a connection became a black hole.
Why is this needed:
Currently we do not have monitoring for broken long running TCP connections. We observed that client-go and golang net/http#Transport tend to reuse the TCP connections by default and if a connection becomes a black hole, it takes minutes for the clients to detect that until they eventually reconnect. If long running connections (e.g. watch) can break in a similar way, that could cause subtle issues that are probably hard to debug. Metrics about broken connections could be helpful in this case.