Hpa memory - Githubissues

学习下面这个文章https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ 在 Kubernetes 中，Horizontal Pod Autoscaling（HPA）通过自动调整工作负载的 Pod 副本数量来匹配需求。其计算公式为： desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )] 其中，ceil表示向上取整。

假设当前情况如下：

• currentReplicas为 2

• currentMetricValue = 90

• desiredMetricValue = 80

代入公式计算： desiredReplicas = ceil[2 (90 / 80)] = ceil[2 1.125] = ceil[2.25] = 3 同时，根据文档中的说明，控制平面在比率足够接近 1.0（在全局可配置的容差范围内，默认容差为 0.1）时会跳过任何缩放操作。在此例中，计算比率： ratio = currentMetricValue / desiredMetricValue = 90 / 80 = 1.125 1.125 与 1.0 的差异是 0.125，超过了默认的 0.1 容差。因此，控制平面很可能会执行缩放操作，将副本数从 2 增加到 3。首先帮我验证这个说法是否正确。另外帮我输出更多的例子。也就是我可能需要举更多的例子。

你的说法是正确的。根据你提供的情况和计算，Horizontal Pod Autoscaling (HPA) 控制平面会根据当前指标值（currentMetricValue）和期望指标值（desiredMetricValue）之间的比率来决定是否需要调整副本数。在这个例子中，计算得到的比率是 1.125，这超过了默认的 0.1 容差，因此 HPA 会触发缩放操作，将副本数增加到 3。

下面是一些额外的例子，展示了不同的 currentMetricValue 和 desiredMetricValue 的情况下，desiredReplicas 的计算结果：

示例 1：
- currentReplicas = 3
- currentMetricValue = 120
- desiredMetricValue = 100
- ratio = 120 / 100 = 1.2
- desiredReplicas = ceil[3 * (1.2)] = ceil[3.6] = 4
示例 2：
- currentReplicas = 4
- currentMetricValue = 100
- desiredMetricValue = 100
- ratio = 100 / 100 = 1.0
- desiredReplicas = ceil[4 * (1.0)] = ceil[4] = 4 (由于比率接近 1.0，可能不会触发缩放)
示例 3：
- currentReplicas = 2
- currentMetricValue = 160
- desiredMetricValue = 100
- ratio = 160 / 100 = 1.6
- desiredReplicas = ceil[2 * (1.6)] = ceil[3.2] = 4
示例 4：
- currentReplicas = 5
- currentMetricValue = 50
- desiredMetricValue = 100
- ratio = 50 / 100 = 0.5
- desiredReplicas = ceil[5 * (0.5)] = ceil[2.5] = 3 (缩放操作会将副本数减少到 3)
示例 5：
- currentReplicas = 10
- currentMetricValue = 90
- desiredMetricValue = 100
- ratio = 90 / 100 = 0.9
- desiredReplicas = ceil[10 * (0.9)] = ceil[9] = 10 (由于比率非常接近 1.0，可能不会触发缩放)

在这些示例中，我们可以看到不同的比率如何影响副本数的调整。如果比率接近 1.0，并且差异在容差范围内，HPA 可能不会执行缩放操作。如果比率显著高于或低于 1.0，HPA 将相应地增加或减少副本数。

FailedGetResourceMetric 错误通常表示HPA无法获取到所需的资源指标（如CPU或内存使用情况）。这可能由多种原因引起，如Metrics Server问题、API权限问题、网络问题等。下面是排查和修复此问题的一些步骤：

排查步骤

检查Metrics Server是否正常运行 Metrics Server用于提供HPA所需的指标数据。首先检查Metrics Server的状态：
```
kubectl get deployment metrics-server -n kube-system
kubectl get pods -n kube-system | grep metrics-server
```
确保Metrics Server的Pod处于Running状态。
检查Metrics API是否工作正常 运行以下命令，确保Metrics API可以返回数据：
```
kubectl top nodes
kubectl top pods -n <your-namespace>
```
如果这些命令无法返回数据，说明Metrics Server有问题。

检查HPA配置 确保HPA配置正确，尤其是metrics部分。例如：

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
 name: my-app
 namespace: my-namespace
spec:
 scaleTargetRef:
   apiVersion: apps/v1
   kind: Deployment
   name: my-app
 minReplicas: 1
 maxReplicas: 10
 metrics:
 - type: Resource
   resource:
     name: cpu
     target:
       type: Utilization
       averageUtilization: 80

检查HPA事件日志 查看HPA的事件日志，找出具体的错误信息：
```
kubectl describe hpa <hpa-name> -n <namespace>
```
找到FailedGetResourceMetric错误的具体原因。

检查权限问题 确保HPA具有足够的权限访问Metrics Server。检查RBAC配置：

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
 name: system:metrics-server
rules:
- apiGroups: [""]
 resources: ["pods", "nodes"]
 verbs: ["get", "list", "watch"]

修复步骤

重新部署Metrics Server 如果Metrics Server有问题，可以尝试重新部署它。删除并重新应用Metrics Server的YAML文件，或者使用以下命令：

kubectl delete -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

检查和修复HPA配置 确保HPA的配置正确，并符合你的需求。如果需要修改HPA配置，可以使用以下命令：
```
kubectl edit hpa <hpa-name> -n <namespace>
```

调整RBAC配置 如果权限不足，调整RBAC配置以确保HPA可以访问Metrics Server。可以创建或更新ClusterRole和ClusterRoleBinding：

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: metrics-server:system:auth-delegator
roleRef:
 apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
 name: system:auth-delegator
subjects:
- kind: ServiceAccount
 name: metrics-server
 namespace: kube-system

网络和防火墙检查 确保没有网络问题阻止HPA与Metrics Server的通信。检查集群内的网络配置和防火墙规则。

示例排查命令

# 检查Metrics Server状态
kubectl get deployment metrics-server -n kube-system
kubectl get pods -n kube-system | grep metrics-server

# 检查Metrics API是否工作正常
kubectl top nodes
kubectl top pods -n <your-namespace>

# 查看HPA事件日志
kubectl describe hpa <hpa-name> -n <namespace>

# 编辑HPA配置
kubectl edit hpa <hpa-name> -n <namespace>

通过上述步骤，可以定位和解决FailedGetResourceMetric问题，确保HPA能够正常获取资源指标并执行相应的扩缩容操作。

如果 kubectl describe hpa 命令显示 current 和 desired 数量都是3，而根据计算公式来说，desired 值应该是2，这可能是由于多种原因导致的。我们可以进一步排查以下问题：

排查步骤

检查HPA配置 确保HPA配置正确，并查看其缩放规则。例如：
```
kubectl describe hpa <hpa-name> -n <namespace>
```
检查minReplicas和maxReplicas是否设置正确，并确保目标利用率合理。
检查资源利用率 确保实际的资源利用率符合预期，并确认计算公式是否正确。例如，如果目标CPU利用率是50%，而实际利用率是100%，则HPA会保持或增加pods数量。
```
kubectl top pod -n <namespace> | grep <deployment-name>
```
查看HPA事件日志 查看HPA的事件日志，找出具体的错误信息。例如，HPA可能遇到暂时性的计算错误，或者Metrics Server提供的数据不准确。
```
kubectl describe hpa <hpa-name> -n <namespace>
```
检查Metrics Server状态 确保Metrics Server正常运行，并提供准确的数据。如果Metrics Server有问题，可能导致HPA无法正确计算desired数量。
```
kubectl get deployment metrics-server -n kube-system
kubectl get pods -n kube-system | grep metrics-server
```
检查HPA冷却时间 HPA在进行缩放操作时有一个冷却时间（cooldown period），在此期间不会进行新的缩放操作。检查HPA的冷却时间配置，确保其不会影响缩放操作。

修复步骤

重新部署Metrics Server 如果Metrics Server有问题，可以尝试重新部署它。删除并重新应用Metrics Server的YAML文件，或者使用以下命令：

kubectl delete -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

检查和修复HPA配置 确保HPA的配置正确，并符合你的需求。如果需要修改HPA配置，可以使用以下命令：
```
kubectl edit hpa <hpa-name> -n <namespace>
```
检查并调整冷却时间 如果冷却时间配置不合理，可以进行调整。例如，可以设置一个较短的冷却时间，以便HPA能够更快地响应资源利用率变化。
```
spec:
 behavior:
   scaleDown:
     stabilizationWindowSeconds: 60  # 默认是300秒
```

示例排查命令

# 查看HPA详细信息
kubectl describe hpa <hpa-name> -n <namespace>

# 查看pods的资源利用率
kubectl top pod -n <namespace> | grep <deployment-name>

# 检查Metrics Server状态
kubectl get deployment metrics-server -n kube-system
kubectl get pods -n kube-system | grep metrics-server

# 编辑HPA配置
kubectl edit hpa <hpa-name> -n <namespace>

通过上述步骤，可以定位和解决HPA无法正确缩减pods数量的问题，确保HPA能够根据资源利用率正确调整pods数量。

aibangjuxin / groovy

Hpa memory #155

排查步骤

修复步骤

示例排查命令

排查步骤

修复步骤

示例排查命令