Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 307 forks source link

Common Horizontal Pod Autoscaler events in AKS cluster #1137

Closed brobichaud closed 4 years ago

brobichaud commented 5 years ago

What happened: I am frequently seeing warning events regarding metrics and the horizontal pod autoscaler as seen below:

LAST SEEN   TYPE      REASON                         OBJECT                                          MESSAGE
109s        Warning   FailedGetResourceMetric        horizontalpodautoscaler/dev-burns-portal-hpa    unable to get metrics for resource cpu: no metrics returned from resource metrics API
26m         Warning   FailedComputeMetricsReplicas   horizontalpodautoscaler/dev-burns-portal-hpa    failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API

It's unclear to me whether this is truly an AKS setup issue or is really a k8s or hpa issue, so I'm starting here. I didn't see these when I had a raw AKS-Engine based cluster not long ago.

What you expected to happen: For the HPA to successfully acquire metrics on every call.

How to reproduce it (as minimally and precisely as possible): Setup a simple AKS cluster with Windows nodepool, deploy some pods and setup simple CPU-based autoscale rules, such as:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: dev-burns-portal-hpa
spec:
  minReplicas: 2
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 80
        type: Utilization
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: dev-burns-portal

Monitor system events and you should see events similar to above referenced.

Anything else we need to know?: I was not seeing these with a raw AKS-Engine based cluster of similar configuration.

Environment:

mikkelhegn commented 5 years ago

Could this be caused by this issue with slow response from the api? https://github.com/kubernetes/kubernetes/issues/75752

I've seen metrics become unavailable through kubectl top nodes, with e.g. 10 containers on a Windows node.

brobichaud commented 5 years ago

Hmmm, that issue is definitely suspicious looking, but it is hard for me to say if that would actually cause this. I do see "kubectl top nodes" return "unknown" on my Windows nodepool currently.

cpunella commented 5 years ago

Hi, have you set resource limit in the deployment yaml definition?

brobichaud commented 5 years ago

@cpunella Yes I do have both CPU and RAM limits in all of my deployments. Does that somehow affect hpa or access to metrics data?

zhiweiv commented 5 years ago

It should be the metrics api issue, you can check the metrics server pod logs, kubectl logs -n kube-system --tail=100 metrics-server-66dbbb67db-vtzq9

you will find something like: Failed to get kubelet_summary:10.10.0.252:10255 response in time

brobichaud commented 5 years ago

Indeed @zhiweiv I am seeing events like: Failed to get kubelet_summary...

mixed in with a ton of: No metrics for pod default/dev

Thanks for the steps to confirm.

cpunella commented 5 years ago

@brobichaud yes, I found that if you don't set limits in the deployment, metrics are not collected ...

brobichaud commented 5 years ago

@brobichaud yes, I found that if you don't set limits in the deployment, metrics are not collected ...

But I do have limits on all of my deployments, yet metrics seem hit or miss...

partha-sarathi-sarkar commented 4 years ago

Hi @brobichaud Can you please try this script and let me know it is working or not

apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: dev-burns-portal-hpa spec: minReplicas: 2 maxReplicas: 4 scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: targetCPUUtilizationPercentage: 80

And make sure your deployment file you have allocate resources for your pods

brobichaud commented 4 years ago

Does anyone know if it is accurate to say that this should be addressed by this change: https://github.com/kubernetes/kubernetes/issues/74991

mikkelhegn commented 4 years ago

Thanks my understanding - @marosset to confirm.

marosset commented 4 years ago

I think so. Also, the fix is still only available in 1.18+ (my PRs to backport the fix to 1.15-1.17 are still in limbo)

brobichaud commented 4 years ago

I think so. Also, the fix is still only available in 1.18+ (my PRs to backport the fix to 1.15-1.17 are still in limbo)

If you do end up getting this into earlier k8s releases could you update this issue?

mikkelhegn commented 4 years ago

This got fixed in https://github.com/kubernetes/kubernetes/pull/87730, but introduced this issue: https://github.com/kubernetes/kubernetes/pull/90554, which is now fixed and deployed in AKS from 1.16.9+, 1.17.5+ and 1.18.1+