Graceful handling of metrics server unavailability

Is your feature request related to a problem? Please describe

Metrics server plays a crucial role in providing resource usage data for nodes and pods. However, when the metrics server becomes unavailable or experiences issues, it can lead to cascading problems across the cluster.

Describe the solution you'd like

Given cluster where the metrics server becomes unavailable or unresponsive
When components or tools attempt to fetch metrics data
Then Runtime should...
- Detect the unavailability of the metrics server
- Log appropriate warnings about the metrics server being unreachable
- Clearly communicate the unavailability to users and administrators
- Offer clear error messages and instructions for troubleshooting metrics server issues
- Other/More??

Describe alternatives you've considered

Redundant metrics servers: Deploy multiple instances of the metrics server for high availability. While this could mitigate some issues, it doesn't address the need for graceful degradation if all instances fail.
Local node-level metrics collection: Implement a secondary, simplified metrics collection system at the node level that can provide basic metrics when the central metrics server is unavailable. This could be complex to implement and maintain.

Additional context

This feature would be particularly valuable for production environments where high availability is critical. It would improve the reliability of autoscaling, resource management, and monitoring solutions that depend on the metrics server. Additionally, it would simplify troubleshooting and reduce false alarms during metrics server maintenance or unexpected outages.

defenseunicorns / uds-runtime