Open AntoineDao opened 1 month ago
Why isn't it possible for you to detect this case when you are collecting the data?
I'm uncomfortable with working around this bug in workflows. Have you managed to find or raise an issue in containerd?
We have traced this back to an issue with Containerd which will sometimes fails a Pod and for some reason sets the
startedAt
date to "the epoch" (ie:1970-01-01T00:00:00Z
).I appreciate this feels like more of a containerd bug that an argo-workflows bug
Well that is a containerd bug. I think it makes sense for Argo to assume the data is accurate. If you assume unreliable dependencies, then most of the assumptions go out the door too; it's not really possible to handle all such cases.
however I think it is worth adding some logic to handle this case more gracefully.
I will propose a fix very soon and will let the maintainers decide whether it is worth patching.
Possibly, though that is a workaround rather than a fix. I don't have a strong opinion on it personally although I'd lean toward relying on upstream containerd to fix this.
From our perspective it's worth it because we use the Argo Workflows reported CPU/Memory usage to bill downstream customers... 😨
Huh, I don't think I've seen that pattern before. At my last job, we used KubeCost for that purpose, which also can have bugs, but is at least purpose-built for cost tracking, unlike Argo's calculations.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
When running Argo Workflows we occasionally see that our workflows report surprisingly high CPU and Memory resource durations.
We have traced this back to an issue with Containerd which will sometimes fails a Pod and for some reason sets the
startedAt
date to "the epoch" (ie:1970-01-01T00:00:00Z
).This results in the incorrect calculation of a duration due to the function below which assumes that a startedAt data of 1970 is expected.
https://github.com/argoproj/argo-workflows/blob/25bbb71cced32b671f9ad35f0ffd1f0ddb8226ee/util/resource/summary.go#L16-L22
I appreciate this feels like more of a containerd bug that an argo-workflows bug, however I think it is worth adding some logic to handle this case more gracefully.
For additional context, we are running on top of managed GKE when we see this bug. I am not attaching a reproducible workflow because... well... this bug is challenging to reproduce!
I will propose a fix very soon and will let the maintainers decide whether it is worth patching. From our perspective it's worth it because we use the Argo Workflows reported CPU/Memory usage to bill downstream customers... 😨
Version(s)
v3.5.11
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container