epam / cloud-pipeline

Cloud agnostic genomics analysis, scientific computation and storage platform
https://cloud-pipeline.com
Apache License 2.0
145 stars 59 forks source link

Autoscaler shall not kill the "Lost" nodes immediately #880

Open sidoruka opened 4 years ago

sidoruka commented 4 years ago

Background

From time to time we run into cases when a compute node gets unresponsive (kubelet can't send the heartbeats) due to the high CPU/Memory usage. If such a node is detected - the run is immediately stopped as the node's status is "Unknown".

While such an oversubscription is not a good usage pattern - this may lead to the data/compute results loss. Thus, we shall try to handle such situations.

Approach

  1. If a node gets into "Unknown" state - we shall not kill the run right away
  2. We shall introduce a preference, that shall control the timeout period before stopping a run

Note: Before the implementation itself - please verify if the node/pod can survive the resources starvation (e.g. if the node's state is unknown - is the pod deleted or it can proceed running, once a node is restored)

SilinPavel commented 4 years ago

@sidoruka As I find out in process of investigation, a pod can be alive during resource starvation if we disable eviction strategy, actually I have had 3 different scenarios (resource starvation was modeled by artificial stress test):

  1. When Node starts to be in a "starvation" NodeStatus become NotReady, but pod still will be in Running state, after resources clean up all statuses will go back to normal.
  2. If we will simulate starvation a little longer (around 2 minutes in my case), NodeStatus will have value = unknown, pod still will be in Running state, after resources clean up all statuses will go back to normal.
  3. If a starvation will be simulated much longer(5-7 minutes in my case) we will have pod in unknown state as well, in this case after resource clean up node still can go back to normal state, but pod will be terminated.

So I believe that first and second scenarios is that we are looking for, and in this case our pipeline can be back in a normal state after small resource starvation.

SilinPavel commented 4 years ago

In addition I want to add some details related to this issue. It was found out that we can change setting for node-controller-manager system pod, to be able to increase period of time before pipeline pod will be terminated in case (3). According to documentation (https://kubernetes.io/docs/concepts/architecture/nodes/#node-condition):

If the Status of the Ready condition remains Unknown or False for longer than the pod-eviction-timeout, 
an argument is passed to the kube-controller-manager and all the Pods on the node are scheduled for 
deletion by the Node Controller. The default eviction timeout duration is five minutes. In some cases 
when the node is unreachable, the apiserver is unable to communicate with the kubelet on the node. 
The decision to delete the pods cannot be communicated to the kubelet until communication with the 
apiserver is re-established. In the meantime, the pods that are scheduled for deletion may continue 
to run on the partitioned node.

In versions of Kubernetes prior to 1.5, the node controller would force delete these unreachable pods 
from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until 
it is confirmed that they have stopped running in the cluster. You can see the pods that might be running 
on an unreachable node as being in the Terminating or Unknown state. In cases where Kubernetes cannot deduce 
from the underlying infrastructure if a node has permanently left a cluster, the cluster administrator may 
need to delete the node object by hand. Deleting the node object from Kubernetes causes all the Pod objects 
running on the node to be deleted from the apiserver, and frees up their names.

So pod-eviction-timeout parameter (https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/) can be changed to appropriate value, in this situation we can postpone terminating a pod by node controller.

Of course there is another situation when OOM killer can kill docker container while a node under pressure, but in this case we can't do anything.

In conclusion here is a list of changes that we need to do with kube configuration:

  1. Change pod-eviction-timeout to appropriate value
  2. For each node that join a cluster we should add --eviction-hard= as kubelet parameter in order to disable eviction policy for that node

After this action we can introduce some approach to postpone killing lost nodes by AutoscalerManager