Monitor HttpClient timeouts and force restart of container on reaching threshold

We see instances when a container gets stuck at 100% for one of its cores. After investigating, it appears that the HttpServer object is spinning and therefore unable to adequately handle requests. The effect of this is requests to the node timeout.

To mitigate this issue we could:

Implement code in the EndpointProxy where an HttpClient timeout is pushed to the orchestrator where the orchestrator can keep metrics on it
The orchestrator metrics on it could be monitored and the container on the node restarted if the problem is detected
The health endpoint on the node could somehow try to detect this issue

SlideRuleEarth / sliderule

Monitor HttpClient timeouts and force restart of container on reaching threshold #320