aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.97k stars 289 forks source link

Introduce Metrics Serving in EKS Anywhere #7875

Closed jiayiwang7 closed 4 months ago

jiayiwang7 commented 8 months ago

What would you like to be added:

Introduce options to secure serving metrics on K8s system and EKS-A management components.

Why is this needed:

As an EKS Anywhere cluster administrator, I would like to scrape metrics from the K8s system and EKS-A management components in a simple but secure way. Those metrics are useful for building dashboard and alerts, monitoring the healthy state of a cluster.

Currently in EKS-A, metrics of some system components are already exposed by default (e.g. coredns, kube-api-server). Other system and management components such as kube-controller-manager are configured with the default --bind-address=127.0.0.1 or equivalent, so that these servers are only listening on localhost. The goal is to expose those metrics in a secure fashion so that external monitoring services such as Prometheus can consume them properly.

Details

There are three types of system/management components we would like to serve metrics from:

  1. K8s system components, such as kube-controller-manager, kube-scheduler, kube-proxy.
  2. EKS-A management components, such as eksa-cluster-controller, eks-anywhere-packages.
  3. CAPI components, such as capi-controller, capi-kubeadm-control-plane, capv-controller (provider specific), etcdadm-controller, etcdadm-bootstrap-provider

In the list above, scraping metrics on the secure port of the K8s system components are already introduced as default in Kubernetes with native K8s authentication and authorization workflow: https://github.com/kubernetes/kubernetes/pull/72491. So the controller-manager / scheduler secure metrics should already be enabled by default with --authentication-kubeconfig and authorization-kubeconfig flags. Regarding how they can emit metrics with RBAC, we need more investigation (whether the above core components can all be exposed from the /metrics endpoint via authentication (user/group/SA) and authorization (via RBAC verb: get, nonResourceURLs: /metrics)).

As for CAPI components, all of them are built based of controller-runtime who implemented a feature in its v0.16.0 release to provide a secure endpoint for metrics which uses https and provides authentication and authorization: https://github.com/kubernetes-sigs/controller-runtime/pull/2407. CAPI community took this feature and implemented it to its core controllers in its v1.6.0 release: https://github.com/kubernetes-sigs/cluster-api/pull/9264. Not all the CAPI infrastructure providers have yet implement the same feature but we do expect this to be the API pattern to follow. External etcd components are maintained by the EKS Anywhere team. We can follow the same pattern CAPI core did for secure diagnostics and implement it in etcdadm-controller-manager, etcdadm-bootstrap-provider.

For EKS-A management, it is also built based of controller-runtime. We can follow the same pattern CAPI community did for secure diagnostics -- this requires further changes in the EKS-A cluster-controller-manager and eks-anywhere-packages.

After figuring out how each type of components can serve metrics endpoint securely, we can then decide on how to make them configurable through EKS-A with simplicity and security. Whether it's through EKS-A cluster spec, or doc recommendation with RBAC and ClusterRole.

Planning

We want to prioritize the work of exposing K8s system components first based on request:

As explained above, the metrics authentication and authorization flow are different between those native K8s components vs the rest built on top of controller-runtime. Thus we would like to implement the feature by phases:

  1. A design doc for a solution for all the system and management components. It needs to be generic enough to onboard or be compatible with the K8s/ EKS-A / CAPI / etcd components metrics.
  2. Implementation of exposing K8s system components based on the design.
  3. Introducing secure diagnostics in EKS-A management components featuring controller-runtime authorization for metrics endpoint.
  4. Introducing secure diagnostics in external etcd components featuring controller-runtime authorization for metrics endpoint.
  5. Pushing or contributing to CAPI to enable secure diagnostics features for all EKS-A supported CAPI providers.
  6. Implementation of exposing EKS-A and CAPI components metrics through cluster spec.
sp1999 commented 7 months ago

Design doc - https://quip-amazon.com/bVBOAiJG6969/Expose-metrics-for-all-EKS-A-components-securely