Azure-Samples / virtual-node-autoscale

A sample application to demonstrate Autoscale with AKS Virtual Nodes
MIT License
73 stars 40 forks source link

Prometheus adapter keeps restarting #34

Open jonielsen opened 5 years ago

jonielsen commented 5 years ago

prometheus-adaptor-prometheus-adapter-5cddf7cc64-7st6p 0/1 Running 8 17m

I1210 22:09:59.011415 1 adapter.go:86] successfully using in-cluster auth I1210 22:09:59.091040 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/api?timeout=32s 200 OK in 78 milliseconds I1210 22:09:59.093227 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.096178 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/api/v1?timeout=32s 200 OK in 2 milliseconds I1210 22:09:59.098313 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/apiregistration.k8s.io/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.103816 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/apiregistration.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.106082 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/extensions/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.107925 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/apps/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.109920 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/apps/v1beta2?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.111854 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/apps/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.114300 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/events.k8s.io/v1beta1?timeout=32s 200 OK in 2 milliseconds I1210 22:09:59.116655 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/authentication.k8s.io/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.118685 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/authentication.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.122686 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/authorization.k8s.io/v1?timeout=32s 200 OK in 3 milliseconds I1210 22:09:59.124517 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/authorization.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.126343 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/autoscaling/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.128408 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/autoscaling/v2beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.130512 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/batch/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.132255 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/batch/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.134322 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/certificates.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.136370 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/networking.k8s.io/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.138662 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/policy/v1beta1?timeout=32s 200 OK in 2 milliseconds I1210 22:09:59.147633 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/rbac.authorization.k8s.io/v1?timeout=32s 200 OK in 8 milliseconds I1210 22:09:59.150959 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/rbac.authorization.k8s.io/v1beta1?timeout=32s 200 OK in 3 milliseconds I1210 22:09:59.152894 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/storage.k8s.io/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.154756 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/storage.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.156605 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/admissionregistration.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.159388 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/admissionregistration.k8s.io/v1alpha1?timeout=32s 200 OK in 2 milliseconds I1210 22:09:59.161336 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/apiextensions.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.163542 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/scheduling.k8s.io/v1beta1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.165642 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/monitoring.coreos.com/v1?timeout=32s 200 OK in 1 milliseconds I1210 22:09:59.167575 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/custom.metrics.k8s.io/v1beta1?timeout=32s 503 Service Unavailable in 1 milliseconds I1210 22:09:59.167833 1 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\""; Kind string "json:\"kind,omitempty\"" } I1210 22:09:59.180671 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/apis/metrics.k8s.io/v1beta1?timeout=32s 200 OK in 12 milliseconds I1210 22:09:59.194052 1 api.go:74] GET http://prometheus-prometheus-0.default.svc.cluster.local:9090/api/v1/series?match%5B%5D=request_durations_histogram_secs_count%7Bnamespace%21%3D%22%22%2C+pod%21%3D%22%22%7D&start=1544479769.182 200 OK I1210 22:10:00.108908 1 serving.go:273] Generated self-signed cert (/tmp/cert/apiserver.crt, /tmp/cert/apiserver.key) I1210 22:10:00.747100 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication 200 OK in 20 milliseconds I1210 22:10:00.758810 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication 200 OK in 10 milliseconds I1210 22:10:00.763781 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication 200 OK in 4 milliseconds I1210 22:10:00.768479 1 round_trippers.go:405] GET https://vnode1-c69a13ec.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication 200 OK in 4 milliseconds I1210 22:10:00.771040 1 healthz.go:83] Installing healthz checkers:"ping" I1210 22:10:00.771193 1 serve.go:96] Serving securely on [::]:6443

bash-3.2$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pod/*/requests_per_second | jq . Error from server (ServiceUnavailable): the server is currently unable to handle the request

cradle77 commented 5 years ago

I have the same problem, apparently the pod's readiness probe is returning a 403


  Type     Reason     Age                  From                               Message
  ----     ------     ----                 ----                               -------
  Normal   Scheduled  2m21s                default-scheduler                  Successfully assigned default/prometheus-adaptor-prometheus-adapter-5cddf7cc64-vlvls to aks-nodepool1-17033719-0
  Warning  Unhealthy  32s (x4 over 102s)   kubelet, aks-nodepool1-17033719-0  Readiness probe failed: HTTP probe failed with statuscode: 403
  Normal   Pulled     22s (x3 over 2m20s)  kubelet, aks-nodepool1-17033719-0  Container image "directxman12/k8s-prometheus-adapter-amd64:v0.4.0" already present on machine
  Normal   Created    22s (x3 over 2m20s)  kubelet, aks-nodepool1-17033719-0  Created container
  Warning  Unhealthy  22s (x6 over 102s)   kubelet, aks-nodepool1-17033719-0  Liveness probe failed: HTTP probe failed with statuscode: 403
  Normal   Killing    22s (x2 over 82s)    kubelet, aks-nodepool1-17033719-0  Killing container with id docker://prometheus-adapter:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Started    21s (x3 over 2m20s)  kubelet, aks-nodepool1-17033719-0  Started container```
ams0 commented 5 years ago

I was pointed to this (insecure) fix: https://github.com/helm/charts/issues/10222 it works however watch out for potential security holes.

lachie83 commented 5 years ago

I believe that this issue is actually due to the adapter getting scheduled to virtual-node which doesn't yet support readiness/liveness probes.

lachie83 commented 5 years ago

I believe the flow to be as follows:

lachie83 commented 5 years ago

cc @jeremyrickard

lachie83 commented 5 years ago

/assign @lachie83

lachie83 commented 5 years ago

Here are the three possible ideas I have to solve this at the moment:

  1. Spilt the monitoring stack from the application stack and make the vn-affinity admission controller only patch the pods in the namespace which the online-store is deployed. The current setup doesn't allow for the online-store app and the Prometheus monitoring stack to be in separate namespaces.
  2. Update the vn-affinity admission controller to match a user-defined label so that it only patches a subset of pods that match thus not touching the monitoring stack pods.
  3. Remove the vn-affinity admission controller and add the toleration and node affinity changes directly to the online-store deployment.
lachie83 commented 5 years ago

cc @rbitia