argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.98k stars 3.19k forks source link

Incorrect resources duration when `containerd` sets incorrect `startedAt` #13709

Open AntoineDao opened 2 weeks ago

AntoineDao commented 2 weeks ago

Pre-requisites

What happened? What did you expect to happen?

When running Argo Workflows we occasionally see that our workflows report surprisingly high CPU and Memory resource durations.

image

We have traced this back to an issue with Containerd which will sometimes fails a Pod and for some reason sets the startedAt date to "the epoch" (ie: 1970-01-01T00:00:00Z).

...
"initContainerStatuses": [
{
  "containerID": "containerd://b55ff9eb8687235ae0fecd4216522b75b6f7a0fab5eca18de8c06365bd2636fc",
  "image": "quay.io/argoproject/argoexec:v3.5.11",
  "imageID": "quay.io/argoproject/argoexec@sha256:bb3938480cbe7a7c0f053eb77a5d3edb22c868bf57407bf066fd105961f26c72",
  "lastState": {},
  "name": "init",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "terminated": {
      "containerID": "containerd://b55ff9eb8687235ae0fecd4216522b75b6f7a0fab5eca18de8c06365bd2636fc",
      "exitCode": 128,
      "finishedAt": "2024-09-13T13:28:03Z",
      "message": "failed to create containerd task: failed to create shim task: context deadline exceeded: unknown",
      "reason": "StartError",
      "startedAt": "1970-01-01T00:00:00Z" <---- here
    }
  }
}
],
...

This results in the incorrect calculation of a duration due to the function below which assumes that a startedAt data of 1970 is expected.

https://github.com/argoproj/argo-workflows/blob/25bbb71cced32b671f9ad35f0ffd1f0ddb8226ee/util/resource/summary.go#L16-L22

I appreciate this feels like more of a containerd bug that an argo-workflows bug, however I think it is worth adding some logic to handle this case more gracefully.

For additional context, we are running on top of managed GKE when we see this bug. I am not attaching a reproducible workflow because... well... this bug is challenging to reproduce!

I will propose a fix very soon and will let the maintainers decide whether it is worth patching. From our perspective it's worth it because we use the Argo Workflows reported CPU/Memory usage to bill downstream customers... 😨

Version(s)

v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

N/A
Joibel commented 1 week ago

Why isn't it possible for you to detect this case when you are collecting the data?

I'm uncomfortable with working around this bug in workflows. Have you managed to find or raise an issue in containerd?

agilgur5 commented 1 week ago

We have traced this back to an issue with Containerd which will sometimes fails a Pod and for some reason sets the startedAt date to "the epoch" (ie: 1970-01-01T00:00:00Z).

I appreciate this feels like more of a containerd bug that an argo-workflows bug

Well that is a containerd bug. I think it makes sense for Argo to assume the data is accurate. If you assume unreliable dependencies, then most of the assumptions go out the door too; it's not really possible to handle all such cases.

however I think it is worth adding some logic to handle this case more gracefully.

I will propose a fix very soon and will let the maintainers decide whether it is worth patching.

Possibly, though that is a workaround rather than a fix. I don't have a strong opinion on it personally although I'd lean toward relying on upstream containerd to fix this.

From our perspective it's worth it because we use the Argo Workflows reported CPU/Memory usage to bill downstream customers... 😨

Huh, I don't think I've seen that pattern before. At my last job, we used KubeCost for that purpose, which also can have bugs, but is at least purpose-built for cost tracking, unlike Argo's calculations.