Is your feature request related to a problem? Please describe
Metrics server plays a crucial role in providing resource usage data for nodes and pods. However, when the metrics server becomes unavailable or experiences issues, it can lead to cascading problems across the cluster.
Describe the solution you'd like
Given cluster where the metrics server becomes unavailable or unresponsive
When components or tools attempt to fetch metrics data
Then Runtime should...
Detect the unavailability of the metrics server
Log appropriate warnings about the metrics server being unreachable
Clearly communicate the unavailability to users and administrators
Offer clear error messages and instructions for troubleshooting metrics server issues
Other/More??
Describe alternatives you've considered
Redundant metrics servers: Deploy multiple instances of the metrics server for high availability. While this could mitigate some issues, it doesn't address the need for graceful degradation if all instances fail.
Local node-level metrics collection: Implement a secondary, simplified metrics collection system at the node level that can provide basic metrics when the central metrics server is unavailable. This could be complex to implement and maintain.
Additional context
This feature would be particularly valuable for production environments where high availability is critical. It would improve the reliability of autoscaling, resource management, and monitoring solutions that depend on the metrics server. Additionally, it would simplify troubleshooting and reduce false alarms during metrics server maintenance or unexpected outages.
Is your feature request related to a problem? Please describe
Metrics server plays a crucial role in providing resource usage data for nodes and pods. However, when the metrics server becomes unavailable or experiences issues, it can lead to cascading problems across the cluster.
Describe the solution you'd like
Describe alternatives you've considered
Additional context
This feature would be particularly valuable for production environments where high availability is critical. It would improve the reliability of autoscaling, resource management, and monitoring solutions that depend on the metrics server. Additionally, it would simplify troubleshooting and reduce false alarms during metrics server maintenance or unexpected outages.