We used version 1.28.5 before and upgraded to 1.30. The node status changed to unknown and unready status many times. This problem never occurred before. We also checked the official documents of aks and found that the memory reservation was adjusted after aks1.29. Currently, the above-described problem has occurred in those who upgraded from 1.28 to 1.30, and everything is normal without upgrading. We looked at the reserved information
1.30 is as follows
Allocatable:
cpu: 15740m
ephemeral-storage: 479347519924
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 59329368Ki
pods: 60
1.28 is as follows
Allocatable:
cpu: 15740m
ephemeral-storage: 119703055367
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 64460556Ki
pods: 60
We suspect that there is a problem with the current reservation algorithm of AKS, which leads to insufficient node memory reservation, causing the node status to become unknown, resulting in the forced eviction of the pod where the node is located. I think this logic is problematic. In 1.28, there will be no problem with the node. Even if the memory usage exceeds the usage, only the pod that uses more memory will be evicted. Now the entire node cannot be used, which is a serious problem.
This issue has been automatically marked as stale because it has not had any activity for 21 days. It will be closed if no further activity occurs within 7 days of this comment.
We used version 1.28.5 before and upgraded to 1.30. The node status changed to unknown and unready status many times. This problem never occurred before. We also checked the official documents of aks and found that the memory reservation was adjusted after aks1.29. Currently, the above-described problem has occurred in those who upgraded from 1.28 to 1.30, and everything is normal without upgrading. We looked at the reserved information 1.30 is as follows Allocatable: cpu: 15740m ephemeral-storage: 479347519924 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 59329368Ki pods: 60
1.28 is as follows Allocatable: cpu: 15740m ephemeral-storage: 119703055367 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 64460556Ki pods: 60
We suspect that there is a problem with the current reservation algorithm of AKS, which leads to insufficient node memory reservation, causing the node status to become unknown, resulting in the forced eviction of the pod where the node is located. I think this logic is problematic. In 1.28, there will be no problem with the node. Even if the memory usage exceeds the usage, only the pod that uses more memory will be evicted. Now the entire node cannot be used, which is a serious problem.