CentaurusInfra / arktos

Arktos for large-scale cloud platform
Apache License 2.0
245 stars 69 forks source link

Pod deletion were never seen in kubelet, leads pod in a infinite "terminating" stage #584

Open yb01 opened 4 years ago

yb01 commented 4 years ago

What happened: In the multi api, multi workload controller perf run. quite a few pods were stuck in "terminating" stage. the logs indicated the pod were recreated immediately after being deleted at the controller. and the kubelet were never catch the deletion of the pod. so it ended up:

  1. pod stays in terminating status with containers running in it.
  2. kubelet nevers see the delete or remove of the pod and stays in sync with a pod in a running stage.

What you expected to happen: pod deletion should be seen at the kubelet and the pod deletion process to be completed at both kuelet and api server

How to reproduce it (as minimally and precisely as possible): this is in the perf/scalability run. no simple repro at this moment.

Anything else we need to know?: For the pod that staying Terminating stage: saturation-deployment-0-5c568bc7fc-49jvp

Created at 01:00:57

I0811 01:00:57.634542 1 replica_set.go:594] Finished syncing ReplicaSet "f3ytox-testns/saturation-deployment-0-5c568bc7fc" (18.75µs)I0811 01:00:57.637253 1 replica_set.go:294] Pod saturation-deployment-0-5c568bc7fc-49jvp created: &v1.Pod{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"saturation-deployment-0-5c568bc7fc-49jvp",

Deleted at 01:21:16

54720 I0811 01:21:16.920297 1 deployment_controller.go:370] Pod saturation-deployment-0-5c568bc7fc-49jvp deleted. 54721 I0811 01:21:16.920305 1 deployment_controller.go:438] Cannot get replicaset "saturation-deployment-0-5c568bc7fc" for pod "saturation-deployment-0-5c568bc7fc-49jvp": replicaset.apps "saturation-deploym ent-0-5c568bc7fc" not found

Re-create it at 01:21:19

60061 I0811 01:21:19.868398 1 replica_set.go:594] Finished syncing ReplicaSet "f3ytox-testns/saturation-deployment-0-5c568bc7fc" (20.814µs) 60062 I0811 01:21:19.868366 1 replica_set.go:294] Pod saturation-deployment-0-5c568bc7fc-49jvp created: &v1.Pod{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"saturation-deploy ment-0-5c568bc7fc-49jvp", GenerateName:"saturation-deployment-0-5c568bc7fc-", Tenant:"system", Namespace:"f3ytox-testns", SelfLink:"/api/v1/namespaces/f3ytox-testns/pods/saturation-deployment-0-5c568bc7fc-4

From the Kubelet logs, it never get the “delete” or “remove” from its syncloop, which confirmed from these two search in the logs. Ying, I guess something might be wrong at the workload controller side.

ybai2016@shared-wc-api-multi2-minion-group-63fm /var/ybtest $ grep -i "syncloop (del" kubelet-hollow-node-6jc4c.lo* | grep -i 49jvp
ybai2016@shared-wc-api-multi2-minion-group-63fm /var/ybtest $ grep -i "syncloop (rem" kubelet-hollow-node-6jc4c.lo* | grep -i 49jvp
ybai2016@shared-wc-api-multi2-minion-group-63fm /var/ybtest $

I randomly picked up one pod that is deleted ( i.e. no in the Terminating stage). The log in the controller makes a sense to me. So I suspect for some reasons, the controller recreated the pod but somehow ended up half way done and left the pod in the “terminating with graceful period” at the API server however Kubelet picked up the new stage and never get the delete update option from api server. Pod Created at 01:07:21

I0811 01:07:21.687735 1 replica_set.go:294] Pod saturation-deployment-0-5c568bc7fc-nrxbq created: &v1.Pod{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"saturation-deploy ment-0-5c568bc7fc-nrxbq", GenerateName:"saturation-deployment-0-5c568bc7fc-", Tenant:"system", Namespace:"b4vt2l-testns", SelfLink:"/api/v1/namespaces/b4vt2l-testns/pods/saturation-deployment-0-5c568bc7fc-n rxbq", UID:"9ae632d2-9482-4bef-a07c-a91d63fc650c", HashKey:6712346518683512058, ResourceVersion:"837344580955275265", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63732704841, loc:(*

Pod Deleted at 01:16:13 and never shows in the log below this point.

I0811 01:16:13.919783 1 deployment_controller.go:370] Pod saturation-deployment-0-5c568bc7fc-nrxbq deleted.
I0811 01:16:13.919814 1 deployment_controller.go:438] Cannot get replicaset "saturation-deployment-0-5c568bc7fc" for pod "saturation-deployment-0-5c568bc7fc-nrxbq": replicaset.apps "saturation-deploym ent-0-5c568bc7fc" not found

Environment:

Sindica commented 4 years ago

As discussed in email thread, the above log is not evidence that pod get recreated. We need to dig further to see the root cause.

Sindica commented 4 years ago

https://github.com/futurewei-cloud/arktos/pull/592

yb01 commented 4 years ago

The log does shows the two creations of the pod, one after its being deleted, even this is just a controller log entry per some events. the log entry didn't show what component make the creation and when --- and the api server logs were rotated out in this perf run.

Update the title and reflect the kubelet not see the "deletion" if this help.

yxiong2020 commented 4 years ago

pump up the log level and still investigate

sonyafenge commented 4 years ago

9/3/2020 [issuedebug][arktos][nsdelete]

so far, highqps runs has 100% repro for this issue. new run with log-level 4 has finished with error:

E0904 01:11:18.686078   28364 simple_test_executor.go:382] Resource cleanup error: [Operation cannot be fulfilled on namespaces "v9linr-testns": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.]

logs are located at sonyadev4: /home/sonyali/logs/perf-test/gce-5000/arktos/0903-nsdelete-3a1w1e under GCP project: workload-controller-manager

yb01 commented 4 years ago

Finding in logs, looking into code to further actions: api server got the deletion from the kube-controller-manager(GC), and returns 200 which indicate the api part of the deletion is successfully initiated so identifying why API server failed to notify kubelet here is a key to this issue.

----- workload-controller-
workload-controller-manager.log-20200903-1599162601.gz:I0903 19:48:24.174110       1 controller_utils.go:617] Controller saturation-deployment-0-5c568bc7fc created pod saturation-deployment-0-5c568bc7fc-64p4v
workload-controller-manager.log-20200903-1599162601.gz:I0903 19:48:24.174542       1 event.go:259] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"v9linr-testns", Name:"saturation-deployment-0-5c568b
c7fc", UID:"b04ce4f9-81b3-4259-90da-a0925a54a7a7", APIVersion:"apps/v1", ResourceVersion:"838421704573714435", FieldPath:"", Tenant:"system"}): type: 'Normal' reason: 'SuccessfulCreate' Created pod: saturation-deployment-0-5c568bc7fc-64p4v

workload-controller-manager.log-20200903-1599162601.gz:I0903 19:48:24.276667       1 replica_set.go:294] Pod saturation-deployment-0-5c568bc7fc-64p4v created: &v1.Pod{TypeMeta:v1.TypeMeta{Kind:"", APIVersio
n:""}, ObjectMeta:v1.ObjectMeta{Name:"saturation-deployment-0-5c568bc7fc-64p4v", GenerateName:"saturation-deployment-0-5c568bc7fc-", Tenant:"system", Namespace:"v9linr-testns", SelfLink:"/api/v1/namespaces/
v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v", UID:"6e4ed0a7-9e06-463d-90f8-7a175375bd2b", HashKey:8168093572830555521, ResourceVersion:"838421710986280967", Generation:0, CreationTimestamp:v
1.Time{Time:time.Time{wall:0x0, ext:63734759304, loc:(*time.Location)(0x2ba7b80)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string{"group":"saturation"
, "name":"saturation-deployment-0", "pod-template-hash":"5c568bc7fc"}, Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference{v1.OwnerReference{APIVersion:"apps/v1", Kind:"ReplicaSet", Name
:"saturation-deployment-0-5c568bc7fc", UID:"b04ce4f9-81b3-4259-90da-a0925a54a7a7", HashKey:3581205081481740822, Controller:(*bool)(0xc021f6b187), BlockOwnerDeletion:(*bool)(0xc021f6b188)}}, Initializers:(*v
1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, Spec:v1.PodSpec{VPC:"", Nics:[]v1.Nic(nil), Volumes:[]v1.Volume{v1.Volume{Name:"default-token-rv7
rw", VolumeSource:v1.VolumeSource{HostPath:(*v1.HostPathVolumeSource)(nil), EmptyDir:(*v1.EmptyDirVolumeSource)(nil), GCEPersistentDisk:(*v1.GCEPersistentDiskVolumeSource)(nil), AWSElasticBlockStore:(*v1.AW
SElasticBlockStoreVolumeSource)(nil), GitRepo:(*v1.GitRepoVolumeSource)(nil), Secret:(*v1.SecretVolumeSource)(0xc015402f40), NFS:(*v1.NFSVolumeSource)(nil), ISCSI:(*v1.ISCSIVolumeSource)(nil), Glusterfs:(*v
1.GlusterfsVolumeSource)(nil), PersistentVolumeClaim:(*v1.PersistentVolumeClaimVolumeSource)(nil), RBD:(*v1.RBDVolumeSource)(nil), FlexVolume:(*v1.FlexVolumeSource)(nil), Cinder:(*v1.CinderVolumeSource)(nil
), CephFS:(*v1.CephFSVolumeSource)(nil), Flocker:(*v1.FlockerVolumeSource)(nil), DownwardAPI:(*v1.DownwardAPIVolumeSource)(nil), FC:(*v1.FCVolumeSource)(nil), AzureFile:(*v1.AzureFileVolumeSource)(nil), Con
figMap:(*v1.ConfigMapVolumeSource)(nil), VsphereVolume:(*v1.VsphereVirtualDiskVolumeSource)(nil), Quobyte:(*v1.QuobyteVolumeSource)(nil), AzureDisk:(*v1.AzureDiskVolumeSource)(nil), PhotonPersistentDisk:(*v
1.PhotonPersistentDiskVolumeSource)(nil), Projected:(*v1.ProjectedVolumeSource)(nil), PortworxVolume:(*v1.PortworxVolumeSource)(nil), ScaleIO:(*v1.ScaleIOVolumeSource)(nil), StorageOS:(*v1.StorageOSVolumeSo
urce)(nil), CSI:(*v1.CSIVolumeSource)(nil)}}}, InitContainers:[]v1.Container(nil), Containers:[]v1.Container{v1.Container{Name:"saturation-deployment-0", Image:"k8s.gcr.io/pause:3.1", Command:[]string(nil),
 Args:[]string(nil), WorkingDir:"", Ports:[]v1.ContainerPort(nil), EnvFrom:[]v1.EnvFromSource(nil), Env:[]v1.EnvVar(nil), Resources:v1.ResourceRequirements{Limits:v1.ResourceList(nil), Requests:v1.ResourceL
ist{"cpu":resource.Quantity{i:resource.int64Amount{value:1, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"1m", Format:"DecimalSI"}, "memory":resource.Quantity{i:resource.int64Amount{value:10, 
scale:6}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"10M", Format:"DecimalSI"}}}, ResourcesAllocated:v1.ResourceList(nil), ResizePolicy:[]v1.ResizePolicy(nil), VolumeMounts:[]v1.VolumeMount{v1.VolumeM
ount{Name:"default-token-rv7rw", ReadOnly:true, MountPath:"/var/run/secrets/kubernetes.io/serviceaccount", SubPath:"", MountPropagation:(*v1.MountPropagationMode)(nil), SubPathExpr:""}}, VolumeDevices:[]v1.
VolumeDevice(nil), LivenessProbe:(*v1.Probe)(nil), ReadinessProbe:(*v1.Probe)(nil), Lifecycle:(*v1.Lifecycle)(nil), TerminationMessagePath:"/dev/termination-log", TerminationMessagePolicy:"File", ImagePullP
olicy:"IfNotPresent", SecurityContext:(*v1.SecurityContext)(nil), Stdin:false, StdinOnce:false, TTY:false}}, VirtualMachine:(*v1.VirtualMachine)(nil), WorkloadInfo:[]v1.CommonInfo(nil), RestartPolicy:"Alway
s", TerminationGracePeriodSeconds:(*int64)(0xc021f6b5c0), ActiveDeadlineSeconds:(*int64)(nil), DNSPolicy:"ClusterFirst", NodeSelector:map[string]string(nil), ServiceAccountName:"default", DeprecatedServiceA
ccount:"default", AutomountServiceAccountToken:(*bool)(nil), NodeName:"", HostNetwork:false, HostPID:false, HostIPC:false, ShareProcessNamespace:(*bool)(nil), SecurityContext:(*v1.PodSecurityContext)(0xc029
754720), ImagePullSecrets:[]v1.LocalObjectReference(nil), Hostname:"", Subdomain:"", Affinity:(*v1.Affinity)(nil), SchedulerName:"default-scheduler", Tolerations:[]v1.Toleration{v1.Toleration{Key:"node.kube
rnetes.io/not-ready", Operator:"Exists", Value:"", Effect:"NoExecute", TolerationSeconds:(*int64)(0xc021f6b800)}, v1.Toleration{Key:"node.kubernetes.io/unreachable", Operator:"Exists", Value:"", Effect:"NoE
xecute", TolerationSeconds:(*int64)(0xc021f6b820)}}, HostAliases:[]v1.HostAlias(nil), PriorityClassName:"", Priority:(*int32)(0xc021f6b828), DNSConfig:(*v1.PodDNSConfig)(nil), ReadinessGates:[]v1.PodReadine
ssGate(nil), RuntimeClassName:(*string)(nil), EnableServiceLinks:(*bool)(0xc021f6b82c), PreemptionPolicy:(*v1.PreemptionPolicy)(nil)}, Status:v1.PodStatus{Phase:"Pending", Conditions:[]v1.PodCondition(nil),
 Message:"", Reason:"", NominatedNodeName:"", HostIP:"", PodIP:"", StartTime:(*v1.Time)(nil), InitContainerStatuses:[]v1.ContainerStatus(nil), ContainerStatuses:[]v1.ContainerStatus(nil), QOSClass:"Burstabl
e", VirtualMachineStatus:(*v1.VirtualMachineStatus)(nil), NICStatuses:[]v1.NICStatus(nil)}}. rv 838421710986280967

workload-controller-manager.log-20200903-1599163501.gz:I0903 20:01:41.558430       1 replica_set.go:357] Pod saturation-deployment-0-5c568bc7fc-64p4v updated, objectMeta {Name:saturation-deployment-0-5c568b
c7fc-64p4v GenerateName:saturation-deployment-0-5c568bc7fc- Tenant:system Namespace:v9linr-testns SelfLink:/api/v1/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v UID:6e4ed0a7-9e06-46
3d-90f8-7a175375bd2b HashKey:8168093572830555521 ResourceVersion:838421710986280967 Generation:0 CreationTimestamp:2020-09-03 19:48:24 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labe
ls:map[group:saturation name:saturation-deployment-0 pod-template-hash:5c568bc7fc] Annotations:map[] OwnerReferences:[{APIVersion:apps/v1 Kind:ReplicaSet Name:saturation-deployment-0-5c568bc7fc UID:b04ce4f9
-81b3-4259-90da-a0925a54a7a7 HashKey:3581205081481740822 Controller:0xc021f6b187 BlockOwnerDeletion:0xc021f6b188}] Initializers:nil Finalizers:[] ClusterName: ManagedFields:[]} -> {Name:saturation-deploymen
t-0-5c568bc7fc-64p4v GenerateName:saturation-deployment-0-5c568bc7fc- Tenant:system Namespace:v9linr-testns SelfLink:/api/v1/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v UID:6e4ed0
a7-9e06-463d-90f8-7a175375bd2b HashKey:8168093572830555521 ResourceVersion:838422129044094978 Generation:0 CreationTimestamp:2020-09-03 19:48:24 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:
<nil> Labels:map[group:saturation name:saturation-deployment-0 pod-template-hash:5c568bc7fc] Annotations:map[] OwnerReferences:[{APIVersion:apps/v1 Kind:ReplicaSet Name:saturation-deployment-0-5c568bc7fc UI
D:b04ce4f9-81b3-4259-90da-a0925a54a7a7 HashKey:3581205081481740822 Controller:0xc0297aae57 BlockOwnerDeletion:0xc0297aae58}] Initializers:nil Finalizers:[] ClusterName: ManagedFields:[]}.

workload-controller-manager.log-20200903-1599163501.gz:I0903 20:01:41.570361       1 replica_set.go:357] Pod saturation-deployment-0-5c568bc7fc-64p4v updated, objectMeta {Name:saturation-deployment-0-5c568b
c7fc-64p4v GenerateName:saturation-deployment-0-5c568bc7fc- Tenant:system Namespace:v9linr-testns SelfLink:/api/v1/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v UID:6e4ed0a7-9e06-46
3d-90f8-7a175375bd2b HashKey:8168093572830555521 ResourceVersion:838422129044094978 Generation:0 CreationTimestamp:2020-09-03 19:48:24 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labe
ls:map[group:saturation name:saturation-deployment-0 pod-template-hash:5c568bc7fc] Annotations:map[] OwnerReferences:[{APIVersion:apps/v1 Kind:ReplicaSet Name:saturation-deployment-0-5c568bc7fc UID:b04ce4f9
-81b3-4259-90da-a0925a54a7a7 HashKey:3581205081481740822 Controller:0xc0297aae57 BlockOwnerDeletion:0xc0297aae58}] Initializers:nil Finalizers:[] ClusterName: ManagedFields:[]} -> {Name:saturation-deploymen
t-0-5c568bc7fc-64p4v GenerateName:saturation-deployment-0-5c568bc7fc- Tenant:system Namespace:v9linr-testns SelfLink:/api/v1/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v UID:6e4ed0
a7-9e06-463d-90f8-7a175375bd2b HashKey:8168093572830555521 ResourceVersion:838422129051959300 Generation:0 CreationTimestamp:2020-09-03 19:48:24 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:
<nil> Labels:map[group:saturation name:saturation-deployment-0 pod-template-hash:5c568bc7fc] Annotations:map[] OwnerReferences:[{APIVersion:apps/v1 Kind:ReplicaSet Name:saturation-deployment-0-5c568bc7fc UI
D:b04ce4f9-81b3-4259-90da-a0925a54a7a7 HashKey:3581205081481740822 Controller:0xc02419dca7 BlockOwnerDeletion:0xc02419dca8}] Initializers:nil Finalizers:[] ClusterName: ManagedFields:[]}.

workload-controller-manager.log-20200903-1599163501.gz:I0903 20:01:42.826568       1 replica_set.go:357] Pod saturation-deployment-0-5c568bc7fc-64p4v updated, objectMeta {Name:saturation-deployment-0-5c568b
c7fc-64p4v GenerateName:saturation-deployment-0-5c568bc7fc- Tenant:system Namespace:v9linr-testns SelfLink:/api/v1/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v UID:6e4ed0a7-9e06-46
3d-90f8-7a175375bd2b HashKey:8168093572830555521 ResourceVersion:838422129051959300 Generation:0 CreationTimestamp:2020-09-03 19:48:24 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labe
ls:map[group:saturation name:saturation-deployment-0 pod-template-hash:5c568bc7fc] Annotations:map[] OwnerReferences:[{APIVersion:apps/v1 Kind:ReplicaSet Name:saturation-deployment-0-5c568bc7fc UID:b04ce4f9
-81b3-4259-90da-a0925a54a7a7 HashKey:3581205081481740822 Controller:0xc02419dca7 BlockOwnerDeletion:0xc02419dca8}] Initializers:nil Finalizers:[] ClusterName: ManagedFields:[]} -> {Name:saturation-deploymen
t-0-5c568bc7fc-64p4v GenerateName:saturation-deployment-0-5c568bc7fc- Tenant:system Namespace:v9linr-testns SelfLink:/api/v1/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v UID:6e4ed0
a7-9e06-463d-90f8-7a175375bd2b HashKey:8168093572830555521 ResourceVersion:838422129647550465 Generation:0 CreationTimestamp:2020-09-03 19:48:24 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:
<nil> Labels:map[group:saturation name:saturation-deployment-0 pod-template-hash:5c568bc7fc] Annotations:map[] OwnerReferences:[{APIVersion:apps/v1 Kind:ReplicaSet Name:saturation-deployment-0-5c568bc7fc UI
D:b04ce4f9-81b3-4259-90da-a0925a54a7a7 HashKey:3581205081481740822 Controller:0xc041516c97 BlockOwnerDeletion:0xc041516c98}] Initializers:nil Finalizers:[] ClusterName: ManagedFields:[]}.
ybai2016@sonyadev4:/home/sonyali/logs/perf-test/gce-5000/arktos/0903-nsdelete-3a1w1e/logs/nsdelete/kubemarkmaster$

------ Kube-controller-manager:
kube-controller-manager.log-20200903-1599166201.gz:I0903 20:47:20.645463       1 garbagecollector.go:406] processing item [v1/Pod, tenant system, namespace: v9linr-testns, name: saturation-deployment-0-5c56
8bc7fc-64p4v, uid: 6e4ed0a7-9e06-463d-90f8-7a175375bd2b]
kube-controller-manager.log-20200903-1599166201.gz:I0903 20:47:21.021365       1 garbagecollector.go:519] delete object [v1/Pod, tenant system, namespace: v9linr-testns, name: saturation-deployment-0-5c568b
c7fc-64p4v, uid: 6e4ed0a7-9e06-463d-90f8-7a175375bd2b] with propagation policy Background

------ api-server:
kube-apiserver.log-20200903-1599166501.gz:I0903 20:47:21.773745       1 trace.go:81] Trace[849523224]: "Delete /api/v1/tenants/system/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v" 
(started: 2020-09-03 20:47:21.030250095 +0000 UTC m=+5488.885508434) (total time: 743.451026ms):
kube-apiserver.log-20200903-1599166501.gz:I0903 20:47:21.773985       1 wrap.go:47] DELETE /api/v1/tenants/system/namespaces/v9linr-testns/pods/saturation-deployment-0-5c568bc7fc-64p4v: (743.915949ms) 200 [
kube-controller-manager/v2.7.4 (linux/amd64) kubernetes/cbd6bf2/system:serviceaccount:kube-system:generic-garbage-collector [::1]:40006]

------ etcd:
ybai2016@sonyadev4:/home/sonyali/logs/perf-test/gce-5000/arktos/0903-nsdelete-3a1w1e/logs/nsdelete/kubemarkmaster$ zgrep -i saturation-deployment-0-5c568bc7fc-64p4v etcd*.*
etcd.log-20200904-1599177601.gz:2020-09-03 20:47:21.005196 W | etcdserver: read-only range request "key:\"/registry/pods/v9linr-testns/saturation-deployment-0-5c568bc7fc-64p4v\" " with result "range_response_count:1 size:1340" took too long (332.036287ms) to execute
etcd.log-20200904-1599177601.gz:2020-09-03 20:47:21.317562 W | etcdserver: read-only range request "key:\"/registry/pods/v9linr-testns/saturation-deployment-0-5c568bc7fc-64p4v\" " with result "range_response_count:1 size:1340" took too long (265.500101ms) to execute
etcd.log-20200904-1599177601.gz:2020-09-03 20:47:21.545431 W | etcdserver: read-only range request "key:\"/registry/pods/v9linr-testns/saturation-deployment-0-5c568bc7fc-64p4v\" " with result "range_response_count:1 size:1340" took too long (217.001308ms) to execute
ybai2016@sonyadev4:/home/sonyali/logs/perf-test/gce-5000/arktos/0903-nsdelete-3a1w1e/logs/nsdelete/kubemarkmaster$

------ Kubelet:
I0903 20:01:41.489574       6 config.go:420] Receiving a new pod "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)"
I0903 20:01:41.490050       6 kubelet.go:1995] SyncLoop (ADD, "api"): "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)"

I0903 20:01:42.629211       6 kubelet_pods.go:1365] Got phase: pod saturation-deployment-0-5c568bc7fc-64p4v, phase Running

**** update pod  status, to running phase
I0903 20:01:42.637529       6 status_manager.go:504] Patch status for pod "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b):ResourceVersion=838422129051959
300" with "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"ContainersReady\"},{\"type\":\"Ready\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\
"2020-09-03T20:01:42Z\",\"message\":null,\"reason\":null,\"status\":\"True\",\"type\":\"ContainersReady\"},{\"lastTransitionTime\":\"2020-09-03T20:01:42Z\",\"message\":null,\"reason\":null,\"status\":\"True
\",\"type\":\"Ready\"}],\"containerStatuses\":[{\"containerID\":\"unknown://f6b374e2005f98df\",\"image\":\"k8s.gcr.io/pause:3.1\",\"imageID\":\"docker://k8s.gcr.io/pause:3.1\",\"lastState\":{},\"name\":\"sa
turation-deployment-0\",\"ready\":true,\"resources\":{},\"restartCount\":0,\"state\":{\"running\":{\"startedAt\":\"2020-09-03T20:01:42Z\"}}}],\"phase\":\"Running\",\"podIP\":\"10.234.31.159\"}}"
I0903 20:01:42.637590       6 status_manager.go:511] Status for pod "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b):ResourceVersion=838422129647550465" u
pdated successfully: (2, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-03 20:01:41 +0000 UTC Reason: Message:} {Type:Contain
ersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-03 20:01:42 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTr
ansitionTime:2020-09-03 20:01:42 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-03 20:01:41 +0000 UTC Reason: Message:}] Me
ssage: Reason: NominatedNodeName: HostIP:10.64.36.95 PodIP:10.234.31.159 StartTime:2020-09-03 20:01:41 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:saturation-deployment-0 State:{Waiting:nil
Running:&ContainerStateRunning{StartedAt:2020-09-03 20:01:42 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:k8s.gcr.io/pause:3.1 Im
ageID:docker://k8s.gcr.io/pause:3.1 ContainerID:unknown://f6b374e2005f98df Resources:{Limits:map[] Requests:map[]}}] QOSClass:Burstable VirtualMachineStatus:nil NICStatuses:[]})

**** Kubelet never sees the DELETE of the pod and stuck in this loop:
kubelet-hollow-node-7swcm.log:I0904 01:04:13.641779 6 kubelet.go:2062] SyncLoop (SYNC): 1 pods; saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b):ResourceVersion=838422129044094978kubelet-hollow-node-7swcm.log:I0904 01:04:13.641849 6 kubelet_pods.go:1334] Generating status for "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)"kubelet-hollow-node-7swcm.log:I0904 01:04:13.641897 6 kubelet_pods.go:1365] Got phase: pod saturation-deployment-0-5c568bc7fc-64p4v, phase Runningkubelet-hollow-node-7swcm.log:I0904 01:04:13.642141 6 status_manager.go:368] Ignoring same status for pod "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-03 20:01:41 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-03 20:01:42 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-03 20:01:42 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-03 20:01:41 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.64.36.95 PodIP:10.234.31.159 StartTime:2020-09-03 20:01:41 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:saturation-deployment-0 State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-09-03 20:01:42 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:k8s.gcr.io/pause:3.1 ImageID:docker://k8s.gcr.io/pause:3.1 ContainerID:unknown://f6b374e2005f98df Resources:{Limits:map[] Requests:map[]}}] QOSClass:Burstable VirtualMachineStatus:nil NICStatuses:[]}kubelet-hollow-node-7swcm.log:I0904 01:04:13.642311 6 volume_manager.go:350] Waiting for volumes to attach and mount for pod "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)"kubelet-hollow-node-7swcm.log:I0904 01:04:13.642331 6 volume_manager.go:383] All volumes are attached and mounted for pod "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)"kubelet-hollow-node-7swcm.log:I0904 01:04:13.642428 6 kuberuntime_manager.go:917] computePodActions got {KillPod:false CreateSandbox:false SandboxID:c4fa7f4c46d93ac1 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[] ContainersToUpdate:map[] ContainersToRestart:[] Hotplugs:{NICsToAttach:[] NICsToDetach:[]}} for pod "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)"kubelet-hollow-node-7swcm.log:I0904 01:04:13.642557 6 event.go:259] Event(v1.ObjectReference{Kind:"Pod", Namespace:"v9linr-testns", Name:"saturation-deployment-0-5c568bc7fc-64p4v", UID:"6e4ed0a7-9e06-463d-90f8-7a175375bd2b", APIVersion:"v1", ResourceVersion:"838422129044094978", FieldPath:"", Tenant:"system"}): type: 'Warning' reason: 'MissingClusterDNS' pod: "saturation-deployment-0-5c568bc7fc-64p4v_v9linr-testns_system(6e4ed0a7-9e06-463d-90f8-7a175375bd2b)". kubelet doesnot have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.kubelet-hollow-node-7swcm.log:I0904 01:04:13.670571 6 reconciler.go:235] Starting operationExecutor.MountVolume for volume "default-token-rv7rw" (UniqueName: "kubernetes.io/secret/6e4ed0a7-9e06-463d-90f8-7a175375bd2b-default-token-rv7rw") pod "saturation-deployment-0-5c568bc7fc-64p4v" (UID: "6e4ed0a7-9e06-463d-90f8-7a175375bd2b") ResourceVersion: 838422129044094978 Volume is already mounted to pod, but remount was requested.
yb01 commented 4 years ago

A local repo without running perf:

  1. two api servers, one workload controller with qps=2000
  2. two tenants, each with one namespace, two deployments with replicaset=100 each
  3. --max-pods=200 on node

create deployments and after all pods running, then delete the deployments

yb01 commented 4 years ago

From last Friday's mini stress with log level 6:

  1. the pods stucking in terminating never get into the actual deletion to etcd. i.e. log in api server never shows "going to delete".
  2. large number of etcd errors

segment of code in registry/generic/registry/store.go


    // delete immediately, or no graceful deletion supported
    klog.V(6).Infof("going to delete %s from registry: ", name)
    out = e.NewFunc()
    if err := e.Storage.Delete(ctx, key, out, &preconditions, storage.ValidateObjectFunc(deleteValidation), dryrun.IsDryRun(options.DryRun)); err != nil {
        // Please refer to the place where we set ignoreNotFound for the reason
        // why we ignore the NotFound error .
        if storage.IsNotFound(err) && ignoreNotFound && lastExisting != nil {
            // The lastExisting object may not be the last state of the object
            // before its deletion, but it's the best approximation.
            out, err := e.finalizeDelete(ctx, lastExisting, true)
            return out, true, err
        }
        return nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name)
    }
    out, err = e.finalizeDelete(ctx, out, true)
    return out, true, err
}