defenseunicorns / uds-runtime

UDS Runtime API & UI
GNU Affero General Public License v3.0
6 stars 3 forks source link

Graceful handling of metrics server unavailability #377

Open adam-defenseunicorns opened 2 weeks ago

adam-defenseunicorns commented 2 weeks ago

Is your feature request related to a problem? Please describe

Metrics server plays a crucial role in providing resource usage data for nodes and pods. However, when the metrics server becomes unavailable or experiences issues, it can lead to cascading problems across the cluster.

Describe the solution you'd like

Describe alternatives you've considered

  1. Redundant metrics servers: Deploy multiple instances of the metrics server for high availability. While this could mitigate some issues, it doesn't address the need for graceful degradation if all instances fail.
  2. Local node-level metrics collection: Implement a secondary, simplified metrics collection system at the node level that can provide basic metrics when the central metrics server is unavailable. This could be complex to implement and maintain.

Additional context

This feature would be particularly valuable for production environments where high availability is critical. It would improve the reliability of autoscaling, resource management, and monitoring solutions that depend on the metrics server. Additionally, it would simplify troubleshooting and reduce false alarms during metrics server maintenance or unexpected outages.

decleaver commented 1 day ago

API changes implemented and merged. Waiting for final design to start UI work