Incorrect resources duration when `containerd` sets incorrect `startedAt`

AntoineDao commented 1 month ago

Pre-requisites

[X] I have double-checked my configuration
[X] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
[X] I have searched existing issues and could not find a match for this bug
[X] I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

When running Argo Workflows we occasionally see that our workflows report surprisingly high CPU and Memory resource durations.

We have traced this back to an issue with Containerd which will sometimes fails a Pod and for some reason sets the startedAt date to "the epoch" (ie: 1970-01-01T00:00:00Z).

...
"initContainerStatuses": [
{
  "containerID": "containerd://b55ff9eb8687235ae0fecd4216522b75b6f7a0fab5eca18de8c06365bd2636fc",
  "image": "quay.io/argoproject/argoexec:v3.5.11",
  "imageID": "quay.io/argoproject/argoexec@sha256:bb3938480cbe7a7c0f053eb77a5d3edb22c868bf57407bf066fd105961f26c72",
  "lastState": {},
  "name": "init",
  "ready": false,
  "restartCount": 0,
  "started": false,
  "state": {
    "terminated": {
      "containerID": "containerd://b55ff9eb8687235ae0fecd4216522b75b6f7a0fab5eca18de8c06365bd2636fc",
      "exitCode": 128,
      "finishedAt": "2024-09-13T13:28:03Z",
      "message": "failed to create containerd task: failed to create shim task: context deadline exceeded: unknown",
      "reason": "StartError",
      "startedAt": "1970-01-01T00:00:00Z" <---- here
    }
  }
}
],
...

This results in the incorrect calculation of a duration due to the function below which assumes that a startedAt data of 1970 is expected.

https://github.com/argoproj/argo-workflows/blob/25bbb71cced32b671f9ad35f0ffd1f0ddb8226ee/util/resource/summary.go#L16-L22

I appreciate this feels like more of a containerd bug that an argo-workflows bug, however I think it is worth adding some logic to handle this case more gracefully.

For additional context, we are running on top of managed GKE when we see this bug. I am not attaching a reproducible workflow because... well... this bug is challenging to reproduce!

I will propose a fix very soon and will let the maintainers decide whether it is worth patching. From our perspective it's worth it because we use the Argo Workflows reported CPU/Memory usage to bill downstream customers... 😨

Version(s)

v3.5.11

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

N/A

Joibel commented 1 month ago

Why isn't it possible for you to detect this case when you are collecting the data?

I'm uncomfortable with working around this bug in workflows. Have you managed to find or raise an issue in containerd?

agilgur5 commented 1 month ago

We have traced this back to an issue with Containerd which will sometimes fails a Pod and for some reason sets the startedAt date to "the epoch" (ie: 1970-01-01T00:00:00Z).

I appreciate this feels like more of a containerd bug that an argo-workflows bug

Well that is a containerd bug. I think it makes sense for Argo to assume the data is accurate. If you assume unreliable dependencies, then most of the assumptions go out the door too; it's not really possible to handle all such cases.

however I think it is worth adding some logic to handle this case more gracefully.

I will propose a fix very soon and will let the maintainers decide whether it is worth patching.

Possibly, though that is a workaround rather than a fix. I don't have a strong opinion on it personally although I'd lean toward relying on upstream containerd to fix this.

From our perspective it's worth it because we use the Argo Workflows reported CPU/Memory usage to bill downstream customers... 😨

Huh, I don't think I've seen that pattern before. At my last job, we used KubeCost for that purpose, which also can have bugs, but is at least purpose-built for cost tracking, unlike Argo's calculations.

tooptoop4 commented 1 week ago

https://github.com/containerd/containerd/issues/9080

argoproj / argo-workflows