Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 306 forks source link

[BUG] The memory reservation setting after aks 1.29 is unreasonable #4524

Open deny2018 opened 1 month ago

deny2018 commented 1 month ago

We used version 1.28.5 before and upgraded to 1.30. The node status changed to unknown and unready status many times. This problem never occurred before. We also checked the official documents of aks and found that the memory reservation was adjusted after aks1.29. Currently, the above-described problem has occurred in those who upgraded from 1.28 to 1.30, and everything is normal without upgrading. We looked at the reserved information 1.30 is as follows Allocatable: cpu: 15740m ephemeral-storage: 479347519924 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 59329368Ki pods: 60

1.28 is as follows Allocatable: cpu: 15740m ephemeral-storage: 119703055367 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 64460556Ki pods: 60

We suspect that there is a problem with the current reservation algorithm of AKS, which leads to insufficient node memory reservation, causing the node status to become unknown, resulting in the forced eviction of the pod where the node is located. I think this logic is problematic. In 1.28, there will be no problem with the node. Even if the memory usage exceeds the usage, only the pod that uses more memory will be evicted. Now the entire node cannot be used, which is a serious problem.

microsoft-github-policy-service[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.