Open sidoruka opened 4 years ago
@sidoruka As I find out in process of investigation, a pod can be alive during resource starvation if we disable eviction strategy, actually I have had 3 different scenarios (resource starvation was modeled by artificial stress test):
So I believe that first and second scenarios is that we are looking for, and in this case our pipeline can be back in a normal state after small resource starvation.
In addition I want to add some details related to this issue. It was found out that we can change setting for node-controller-manager system pod, to be able to increase period of time before pipeline pod will be terminated in case (3). According to documentation (https://kubernetes.io/docs/concepts/architecture/nodes/#node-condition):
If the Status of the Ready condition remains Unknown or False for longer than the pod-eviction-timeout,
an argument is passed to the kube-controller-manager and all the Pods on the node are scheduled for
deletion by the Node Controller. The default eviction timeout duration is five minutes. In some cases
when the node is unreachable, the apiserver is unable to communicate with the kubelet on the node.
The decision to delete the pods cannot be communicated to the kubelet until communication with the
apiserver is re-established. In the meantime, the pods that are scheduled for deletion may continue
to run on the partitioned node.
In versions of Kubernetes prior to 1.5, the node controller would force delete these unreachable pods
from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until
it is confirmed that they have stopped running in the cluster. You can see the pods that might be running
on an unreachable node as being in the Terminating or Unknown state. In cases where Kubernetes cannot deduce
from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may
need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects
running on the node to be deleted from the apiserver, and frees up their names.
So pod-eviction-timeout
parameter (https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) can be changed to appropriate value, in this situation we can postpone terminating a pod by node controller.
Of course there is another situation when OOM killer can kill docker container while a node under pressure, but in this case we can't do anything.
In conclusion here is a list of changes that we need to do with kube configuration:
pod-eviction-timeout
to appropriate value--eviction-hard=
as kubelet parameter in order to disable eviction policy for that nodeAfter this action we can introduce some approach to postpone killing lost nodes by AutoscalerManager
Background
From time to time we run into cases when a compute node gets unresponsive (kubelet can't send the heartbeats) due to the high CPU/Memory usage. If such a node is detected - the run is immediately stopped as the node's status is "Unknown".
While such an oversubscription is not a good usage pattern - this may lead to the data/compute results loss. Thus, we shall try to handle such situations.
Approach
Note: Before the implementation itself - please verify if the node/pod can survive the resources starvation (e.g. if the node's state is unknown - is the pod deleted or it can proceed running, once a node is restored)