Pods stay in graceful termination on executor restart

sttts commented 9 years ago

It was encountered during below scenario:

pod scheduled on to a slave and running.
mesos-slave (mesos-0.23 with slave recovery) process on that slave was restarted. killing mesos-slave killed the minion/executor/kubelet also. docker container(s) still running.
mesos-slave on start tries to recover the framework and executor, but fails to do so and issues kill on the docker container(s) belonging to the pod. slave sends task_lost status
scheduler sees task_lost status and reconciles the pod (deletes it). the default gracefulTerminationPeriod is 30secs.

/xref https://github.com/mesosphere/kubernetes-mesos/issues/495

jdef commented 9 years ago

User that reported this was using systemd. There's a known issue with mesos+systemd and slave recovery. see https://issues.apache.org/jira/browse/MESOS-3352

On Thu, Oct 1, 2015 at 2:55 AM, Dr. Stefan Schimanski < notifications@github.com> wrote:

It was encountered during below scenario:

pod scheduled on to a slave and running.

mesos-slave (mesos-0.23 with slave recovery) process on that slave was restarted. killing mesos-slave killed the minion/executor/kubelet also. docker container(s) still running.

mesos-slave on start tries to recover the framework and executor, but fails to do so and issues kill on the docker container(s) belonging to the pod. slave sends task_lost status

scheduler sees task_lost status and reconciles the pod (deletes it). the default gracefulTerminationPeriod is 30secs.

/xref #495 https://github.com/mesosphere/kubernetes-mesos/issues/495

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/523.

ravilr commented 9 years ago

when i ran through the above scenario with systemd KillMode=process, stopping mesos-slave only stops the mesos-slave. the minion and kubelet/proxy still running along with docker containers. when the mesos slave came backup (within less than a minute), it sent a reconnect request to executor and reregistered it. All the mesos executor minion container and kubelet and proxy were intact and so were the docker containers.

mesos slave log: I1002 01:57:59.619097 1253 slave.cpp:3967] Sending reconnect request to executor b00397ce399d489_k8sm-executor of framework 20150929-020711-628185610-5050-27631-0000 at executor(1)@10.90.115.246:55163 I1002 01:57:59.623270 1255 slave.cpp:2483] Re-registering executor b00397ce399d489_k8sm-executor of framework 20150929-020711-628185610-5050-27631-0000 I1002 01:57:59.626236 1254 cpushare.cpp:409] Updated 'cpu.shares' to 358 (cpus 0.35) for container 9bb4e6a0-b14f-49c3-b9f5-51d7f7bc5ae5 I1002 01:57:59.626283 1255 mem.cpp:358] Updated 'memory.soft_limit_in_bytes' to 562MB for container 9bb4e6a0-b14f-49c3-b9f5-51d7f7bc5ae5 I1002 01:57:59.629832 1255 mem.cpp:393] Updated 'memory.limit_in_bytes' to 562MB for container 9bb4e6a0-b14f-49c3-b9f5-51d7f7bc5ae5 I1002 01:58:01.621327 1252 slave.cpp:2620] Cleaning up un-reregistered executors I1002 01:58:01.622798 1252 slave.cpp:4026] Finished recovery

But saw a failure message in executor.log during executor driver reregistration:

W1001 08:55:01.120988 28437 handler.go:538] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/system.slice/mesos-slave.service/test": 0x40000100 == IN_CREATE|IN_ISDIR): readdirent: no such file or directory I1001 08:55:01.218259 28437 executor.go:275] Executor driver reconnect I1001 08:55:01.218361 28437 executor.go:285] Received reconnect request from slave &SlaveID{Value:_20150929-020711-628185610-5050-27631-S1,XXX_unrecognized:[],} I1001 08:55:01.231128 28437 executor.go:239] Executor driver reregistered I1001 08:55:01.231173 28437 executor.go:250] Re-registered on slave &SlaveID{Value:_20150929-020711-628185610-5050-27631-S1,XXXunrecognized:[],} I1001 08:55:01.231270 28437 executor.go:237] Reregistered with slave &SlaveInfo{Hostname:...com,Resources:[&Resource{Name:_cpus,Type:_SCALAR,Scalar:&Value_Scalar{Value:_2,XXX_unrecognized:[],},Ranges:nil,Set:nil,Role:*,Disk:nil,Reservation:nil,Revocable:nil,XXX_unrecognized:[],} &Resource{Name:_mem,Type:_SCALAR,Scalar:&Value_Scalar{Value:2928,XXX_unrecognized:[],},Ranges:nil,Set:nil,Role:**,Disk:nil,Reservation:nil,Revocable:nil,XXX_unrecognized:[],} &Resource{Name:_disk,Type:_SCALAR,Scalar:&Value_Scalar{Value:_95608,XXX_unrecognized:[],},Ranges:nil,Set:nil,Role:__,Disk:nil,Reservation:nil,Revocable:nil,XXX_unrecognized:[],} &Resource{Name:_ports,Type:_RANGES,Scalar:nil,Ranges:&Value_Ranges{Range:[&Value_Range{Begin:_31000,End:_32000,XXX_unrecognized:[],}],XXX_unrecognized:[],},Set:nil,Role:__,Disk:nil,Reservation:nil,Revocable:nil,XXX_unrecognized:[],}],Attributes:[],Id:&SlaveID{Value:_20150929-020711-628185610-5050-27631-S1,XXX_unrecognized:[],},Checkpoint:_true,Port:_5051,XXX_unrecognized:[],} E1001 08:55:01.231667 28437 executor.go:239] failed to reregister/transition to a connected state

Though everything seemed to continue working normally despite above executor error message.
@jdef is the executor error message expected?

ravilr commented 9 years ago

Another scenario which could leave Terminted pods in k8sm setup:

kube-ui rc with replica=1 binded and running on the single mesos slave cluster. docker daemon was restarted. All the containers were lost and kubelet executor reports task lost .
scheduler gets the task lost status, doesn't do anything
scheduler log: I1002 02:02:54.441693 24690 scheduler.go:364] task status update "TASK_LOST" from "SOURCE_EXECUTOR" for task "pod.58c54f65-68a8-11e5-b4e7-fa163e4a8c06" on slave "20150929-020711-628185610-5050-2 7631-S1" executor "" for reason "none" with message "containers-disappeared" I1002 02:02:54.441332 24690 scheduler.go:699] Sending ACK for status update {FrameworkId:&FrameworkID{Value:_20150929-020711-628185610-5050-27631-0000,XXX_unrecognized:[],} ExecutorId:&ExecutorID{Value:_b00397ce399d489_k8sm-executor,XXX_unrecognized:[],} SlaveId:&SlaveID{Value:_20150929-020711-628185610-5050-27631-S1,XXX_unrecognized:[],} Status:&TaskStatus{TaskId:&TaskID{Value:_pod.58c54f65-68a8-11e5-b4e7-fa163e4a8c06,XXX_unrecognized:[],},State:_TASK_LOST,Data:nil,Message:_containers-disappeared,SlaveId:&SlaveID{Value:_20150929-020711-628185610-5050-27631-S1,XXX_unrecognized:[],},Timestamp:_1.443751374e+09,ExecutorId:nil,Healthy:nil,Source:_SOURCEEXECUTOR,Reason:nil,Uuid:[177 147 173 9 104 169 17 229 149 247 250 22 62 54 71 119],XXX_unrecognized:[],} Timestamp:0xc208bd7e20 Uuid:[177 147 173 9 104 169 17 229 149 247 250 22 62 54 71 119] LatestState:TASK_LOST XXX_unrecognized:[]} to "master@1x.x.x.x:5050"
kubelet executor enters suicide watch (configured with suicide timeout of 2 mins) and exits itself.
controller manager daemon's node controller evicts the pod as the kubelet is not reporting node status. deletes the pod with graceTerminationperiod of 30 secs.
replication controller immediately sees the required replicas less than 1 and creates a new pod
node controller cancels the pod eviction as the new pod got binded to the same slave with new executor.

controller-manager.log (all hosts and ips refers to the same single mesos slave in the cluster):

I1002 02:05:49.676273 20326 nodecontroller.go:313] Evicting pods on node abc.com: 2015-10-02 02:05:49.676223065 +0000 UTC is later than 2015-10-02 02:05:49.676220686 +0000 UTC + 1m0s I1002 02:05:49.677626 20326 mesos.go:174] resolved host 'abc.com' to 'x.x.x.x' I1002 02:05:49.736193 20326 nodecontroller.go:398] Recording Deleting all Pods from Node ...com. event message for node abc.com I1002 02:05:49.736275 20326 nodecontroller.go:604] Delete pod kube-ui-v1-kl1ep I1002 02:05:49.736339 20326 event.go:206] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"...", UID:"....", APIVersion:"", Resourc eVersion:"", FieldPath:""}): reason: 'DeletingAllPods' Node ....com event: Deleting all Pods from Node ....com. I1002 02:05:49.736381 20326 event.go:206] Event(api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"kube-ui-v1-kl1ep", UID:"589d1cc4-68a8-11e5-a382-fa163e4a8c06", APIVersion:"v1", Res ourceVersion:"129848", FieldPath:""}): reason: 'NodeControllerEviction' Deleting Pod kube-ui-v1-kl1ep from Node ...com I1002 02:05:49.761628 20326 replication_controller.go:354] Too few "kube-system"/"kube-ui-v1" replicas, need 1, creating 1 I1002 02:05:49.766979 20326 nodecontroller.go:658] Pod kube-ui-v1-kl1ep still terminating, requested grace period 30s, 29.99596015s remaining I1002 02:05:49.766995 20326 nodecontroller.go:203] Pods terminating since 2015-10-02 02:05:49.762930102 +0000 UTC on "...com", estimated completion 29.99596015s I1002 02:05:49.775889 20326 event.go:206] Event(api.ObjectReference{Kind:"ReplicationController", Namespace:"kube-system", Name:"kube-ui-v1", UID:"580f655e-68a8-11e5-a382-fa163e4a8c06", APIVersion:"v1", ResourceVersion:"129824", FieldPath:""}): reason: 'SuccessfulCreate' Created pod: kube-ui-v1-tqy9h I1002 02:05:54.683301 20326 nodecontroller.go:521] node ...com hasn't been updated for 1m5.1634102s. Last ready condition is: {Type:Ready Status:Unknown LastHeartbeatTime:2015-10-02 02:04:44 +0000 UTC LastTransitionTime:2015-10-02 02:05:29 +0000 UTC Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.} I1002 02:05:54.701356 20326 nodecontroller.go:313] Evicting pods on node ....com: 2015-10-02 02:05:54.701323723 +0000 UTC is later than 2015-10-02 02:05:54.701320144 +0000 UTC + 1m0s I1002 02:05:54.702410 20326 mesos.go:174] resolved host '....com' to 'x.x.x.x' I1002 02:05:56.490336 20326 nodecontroller.go:398] Recording Deleting all Pods from Node ...com. event message for node ...com I1002 02:05:56.490472 20326 event.go:206] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:".com", UID:".com", APIVersion:"", ResourceVersion:"", FieldPath:""}): reason: 'DeletingAllPods' Node ...com event: Deleting all Pods from Node ...com. I1002 02:05:59.706108 20326 nodecontroller.go:575] Cancelling pod Eviction on Node: ....com I1002 02:05:59.706124 20326 nodecontroller.go:318] Node ......com is ready again, cancelled pod eviction

ravilr commented 9 years ago

Another ugly situation we have run into where a pod belonging to a replication controller failed to be rescheduled, during a rolling cluster upgrade maintenance :

the mesos slave executor where the pod was running undergoes some upgrade which resulted in containers getting killed and kubelet executor shuts itself down. Around the same time, the scheduler schedules a completely different pod to the same mesos slave. the new pod comes up with the new kubelet executor. the previously running pod before the upgrade never gets evicted by the controller-manager node-controller, because the kubelet from the same slave host starts reporting node status immediately. As such, the api still seems to have the lost pod and replicaiton controller assumes there are enough pod as per spec replicas.

ravilr commented 9 years ago

@jdef @sttts i think there are two issues from my understanding:

pods remaining in graceful termination when k8sm scheduler deletes the pod during terminal task reconcile or when pod recovery fails during non-terminal task reconcile.
when there is terminal task status update (TASK_lost) from SOURCE_executor, the reconcileTerminalTask() doesn't delete the api.pod.

For (1), do you seen any issues in adding grace_period=0 as the deleteOption to those delete calls?

for (2), especially in case of docker daemon restart, i think in upstream, kubelet relaunces the pod eventually when it can communicate back with the docker daemon within a configurable timeout interval. In k8sm setup, do you see any issues with reconcileTerminalTask() deleting the api.pod? pods launched through replicationController would then be recovered by the controller.

sttts commented 9 years ago

@ravilr thanks for the great analysis.

About (1): sounds reasonable. About (2): if the task is lost the corresponding pod should be deleted IMO I think. We have this huge condition now:

if (state == podtask.StateRunning || state == podtask.StatePending) &&
        ((taskStatus.GetSource() == mesos.TaskStatus_SOURCE_MASTER && taskStatus.GetReason() == mesos.TaskStatus_REASON_RECONCILIATION) ||
            (taskStatus.GetSource() == mesos.TaskStatus_SOURCE_SLAVE && taskStatus.GetReason() == mesos.TaskStatus_REASON_EXECUTOR_TERMINATED) ||
            (taskStatus.GetSource() == mesos.TaskStatus_SOURCE_SLAVE && taskStatus.GetReason() == mesos.TaskStatus_REASON_EXECUTOR_UNREGISTERED)) {

Do you suggest another case we should consider for deleting?

If you have some code already, a PR is highly welcome.

ravilr commented 9 years ago

have below change in my repo:

handles task lost status update from executor on docker daemon restart, by deleting the pod so that the RC creates a new one (if the pod is backed by RC).
Also while deleting pod on terminal task status update, deletes with grace_period=0 as the corresponding pod is already gone.

xref https://github.com/kubernetes/kubernetes/pull/15518

sttts commented 9 years ago

The PR is merged. Closing.

d2iq-archive / kubernetes-mesos

Pods stay in graceful termination on executor restart #523