gardener / monitoring

Components needed for Gardener monitoring
Apache License 2.0
1 stars 3 forks source link

Add monitoring for broken long running connections #18

Open istvanballok opened 2 years ago

istvanballok commented 2 years ago

What would you like to be added:

Implement a prober that calls the /healthz endpoint of the API server in the same connection periodically as long as the connection is cleanly closed, reset or becomes a black hole: the server simply does not respond. The prober should expose a counter metric how often a connection became a black hole.

Why is this needed:

Currently we do not have monitoring for broken long running TCP connections. We observed that client-go and golang net/http#Transport tend to reuse the TCP connections by default and if a connection becomes a black hole, it takes minutes for the clients to detect that until they eventually reconnect. If long running connections (e.g. watch) can break in a similar way, that could cause subtle issues that are probably hard to debug. Metrics about broken connections could be helpful in this case.