Closed amitm02 closed 3 years ago
Can you please confirm this was working in v3.0?
@amitm02 can you share the controller logs?
time="2021-05-31T14:54:09.809Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:09.813Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:14.816Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:14.820Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:14.874Z" level=info msg="Watch clusterworkflowtemplates 200"
time="2021-05-31T14:54:16.785Z" level=info msg="cleaning up pod" action=deletePod key=argo/do-pipline-6srw4-1016923602/deletePod
time="2021-05-31T14:54:16.788Z" level=info msg="Delete pods 404"
time="2021-05-31T14:54:19.823Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:19.827Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:22.085Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/sequence-cspz4/labelPodCompleted
time="2021-05-31T14:54:22.089Z" level=info msg="Patch pods 404"
time="2021-05-31T14:54:22.090Z" level=warning msg="failed to clean-up pod" action=labelPodCompleted error="pods \"sequence-cspz4\" not found" key=argo/sequence-cspz4/labelPodCompleted
time="2021-05-31T14:54:22.107Z" level=warning msg="Non-transient error: pods \"sequence-cspz4\" not found"
time="2021-05-31T14:54:22.185Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/sequence-lv7cx/labelPodCompleted
time="2021-05-31T14:54:22.189Z" level=info msg="Patch pods 404"
time="2021-05-31T14:54:22.189Z" level=warning msg="failed to clean-up pod" action=labelPodCompleted error="pods \"sequence-lv7cx\" not found" key=argo/sequence-lv7cx/labelPodCompleted
time="2021-05-31T14:54:22.189Z" level=warning msg="Non-transient error: pods \"sequence-lv7cx\" not found"
time="2021-05-31T14:54:22.285Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/sequence-pvldp/labelPodCompleted
time="2021-05-31T14:54:22.288Z" level=info msg="Patch pods 404"
time="2021-05-31T14:54:22.289Z" level=warning msg="failed to clean-up pod" action=labelPodCompleted error="pods \"sequence-pvldp\" not found" key=argo/sequence-pvldp/labelPodCompleted
time="2021-05-31T14:54:22.289Z" level=warning msg="Non-transient error: pods \"sequence-pvldp\" not found"
time="2021-05-31T14:54:22.385Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/sequence-gkjmm-3818374930/labelPodCompleted
time="2021-05-31T14:54:22.388Z" level=info msg="Patch pods 404"
time="2021-05-31T14:54:22.389Z" level=warning msg="failed to clean-up pod" action=labelPodCompleted error="pods \"sequence-gkjmm-3818374930\" not found" key=argo/sequence-gkjmm-3818374930/labelPodCompleted
time="2021-05-31T14:54:22.389Z" level=warning msg="Non-transient error: pods \"sequence-gkjmm-3818374930\" not found"
time="2021-05-31T14:54:22.485Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-6pjk8-2430312336/deletePod
time="2021-05-31T14:54:22.489Z" level=info msg="Delete pods 404"
time="2021-05-31T14:54:22.585Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-6pjk8-2616966664/deletePod
time="2021-05-31T14:54:22.589Z" level=info msg="Delete pods 404"
time="2021-05-31T14:54:22.785Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-x2pkj-543642184/deletePod
time="2021-05-31T14:54:22.788Z" level=info msg="Delete pods 404"
time="2021-05-31T14:54:22.885Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-x2pkj-2272827344/deletePod
time="2021-05-31T14:54:22.888Z" level=info msg="Delete pods 404"
time="2021-05-31T14:54:22.985Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-kldp5-808541759/deletePod
time="2021-05-31T14:54:22.988Z" level=info msg="Delete pods 404"
time="2021-05-31T14:54:23.085Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-kldp5-3397304079/deletePod
time="2021-05-31T14:54:23.088Z" level=info msg="Delete pods 404"
time="2021-05-31T14:54:24.830Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:24.834Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:29.837Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:29.841Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:34.845Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:34.848Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:39.852Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:39.856Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:44.860Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:44.863Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:49.867Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:49.871Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:54.874Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:54.878Z" level=info msg="Update leases 200"
time="2021-05-31T14:54:59.881Z" level=info msg="Get leases 200"
time="2021-05-31T14:54:59.885Z" level=info msg="Update leases 200"
time="2021-05-31T14:55:04.888Z" level=info msg="Get leases 200"
@alexec , i can not confirm this was working in v3.0. this bug comes and goes. e.g I do have a problem re-produce the bug this morning.
maybe issue related to:
time="2021-05-31T14:54:22.389Z" level=warning msg="failed to clean-up pod" action=labelPodCompleted error="pods \"sequence-gkjmm-3818374930\" not found" key=argo/sequence-gkjmm-3818374930/labelPodCompleted
time="2021-05-31T14:54:22.389Z" level=warning msg="Non-transient error: pods \"sequence-gkjmm-3818374930\" not found"
not sure if related, but i also get this kind of error in the controller
time="2021-06-01T08:19:37.608Z" level=warning msg="failed to clean-up pod" action=terminateContainers error="Internal error occurred: error executing command in container: failed to exec in container: failed to create exec \"3772223fc747a580005ba18f08b3ec142f057678d9e8ab221c215972c353bb1b\": cannot exec in a stopped state: unknown" key=argo/do-pipline-b6mz7-2774831253/terminateContainers
time="2021-06-01T08:19:37.609Z" level=warning msg="Non-transient error: Internal error occurred: error executing command in container: failed to exec in container: failed to create exec \"3772223fc747a580005ba18f08b3ec142f057678d9e8ab221c215972c353bb1b\": cannot exec in a stopped state: unknown"
time="2021-05-31T14:54:22.985Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-kldp5-808541759/deletePod time="2021-05-31T14:54:22.988Z" level=info msg="Delete pods 404" time="2021-05-31T14:54:23.085Z" level=info msg="cleaning up pod" action=deletePod key=argo/sequence-kldp5-3397304079/deletePod time="2021-05-31T14:54:23.088Z" level=info msg="Delete pods 404"
The controller is trying to delete the pods which is not found. looks like it has been deleted already.
I tried the above example in my local and not able to reproduce it.
Can you provide more information? like k8s API server logs, enable debug level on controller
Could it be deleting from the wrong namespace, or with the wrong name? Inspect the code?
@sarabala1979 , Like all annoying bugs, this one is not easily reproducible. It comes and goes. we will try to hunt down the next occurrence and bring more logs.
Same issue, but without the "failed to clean up pod" errors, or any other logs indicating an attempt to delete the pods. (even using --loglevel debug
)
containers inside pods are terminated with "reason:completed", but the pods themselves remain in completed state for a long time after.
What we testing with was a very simple 50 pods sequence, where each pods executes an echo command. (times 10-100 workflows per run)
Version used: 3.1.0-rc12
Additional logs:
@DO-YardenG Can you add pod yaml kubectl get pods <> -yo yaml
? Is it possible to get k8s API server log to see the controller making delete call?
Sure, it's not for the same pod, but a pod with the same behavior (as the pods do get deleted at some point, just has a varying delay)
As you can see, the pod was eventually deleted, but over 15 min after it was considered "completed" (the timing varies, and some pods delete instantly as they complete)
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.
I'd like to reopen this issue please. We are seeing the same behaviour with OnPodSuccess
. We are on 3.0.7. Previously on 3.0.2 I think it was working, but can no longer confirm as we've upgraded.
Details/symptoms:
podGC: strategy: OnPodSuccess
is defined in the Workflow Controller Configmap--from
a ClusterWorkflowTemplatePodGCStrategy
via the manifestPodGCStrategy
applies as per the controller configmapcleaning up pod
followed by "Delete pods 404"
is seen repeating in the Controller logs, referring to pods from workflows that were manually deletedpod_cleanup_queue
(as per prometheus metrics), but at a much lower rate than the pod_queue
. Pods are being added to the pod_cleanup_queue
(many) minutes after the workflows have been completed. Cluster is not under any load at this time.Let me know if there's anything else I can do to help troubleshoot.
They do get deleted eventually, but the timing varies. Unconfirmed, but they seem to be getting deleted more reliably/consistently when the cluster is under heavy load (~15K running pods)
We have the same issue. Completed pods accumulate under heavy load. We are working with a few workflows each has many pods (10k) completing relatively quickly (1-10 seconds).
I've tried:
qps
.pod_cleanup_workers
.Both do not solve the issues.
Summary
While running the workflow below, the completed pods sequence-kldp5-3397304079 and sequence-kldp5-808541759 are not removed by OnPodCompletion strategy.
Runnign on V3.1-rc10, Emissary, GKE
wf.yaml.txt
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.