argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.05k stars 3.2k forks source link

Workflow Failed: wait container is forcefully stopped #9328

Closed dacamposol closed 2 years ago

dacamposol commented 2 years ago

Checklist

Summary

We have updated Argo Workflow to the version 3.3.8 and lately, we noticed some of the Workflows are failing due to the Pods being terminated.

When I analyse the Pod, I noticed the following:

- containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
   image: argoproj/workflow-controller:v3.3.8-linux-amd64
   imageID: argoproj/workflow-controller:v3.3.8-linux-amd64@sha256:1ecc5b305fb784271797c14436c30b4b6d23204d603d416fdb0382f62604325f
   lastState: {}
   name: wait
   ready: false
   restartCount: 0
   started: false
   state:
     terminated:
       containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
       exitCode: 2
       finishedAt: "2022-08-10T07:06:08Z"
       reason: Error
       startedAt: "2022-08-10T07:06:07Z"

Also, when I check the logs of the aforementioned wait container, I don't see anything else than the following two lines:

fatal: bad g in signal handler
fatal: bad g in signal handler

It's not a consistent error, and I have checked that it happens around 1/12 times, but it's weird and I don't see the way of fixing it.

What version are you running?

I'm running Argo Workflows 3.3.8, deployed by Helm in our cluster.

Diagnostics

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

Any Workflow can reproduce this bug on our cluster at a point. Even the one in the examples.

# Logs from the workflow controller:
time="2022-08-10T08:51:07.741Z" level=info msg="Queueing Failed workflow metadata/metadata-process-6phwc for delete in 2m57s due to TTL"

# Logs from in your workflow's wait container, something like:
fatal: bad g in signal handler
fatal: bad g in signal handler

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

sarabala1979 commented 2 years ago

@dacamposol can you provide the minimum reproducible workflow and controller logs for this workflow?

dacamposol commented 2 years ago

Hello @sarabala1979, thanks for your response.

I'm using a simple example workflow as:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-
  labels:
    workflows.argoproj.io/archive-strategy: "false"
  annotations:
    workflows.argoproj.io/description: |
      This is a simple hello world example.
      You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
  entrypoint: whalesay
  templates:
  - name: whalesay
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]

I've spammed it around 20 times and two of them had an error. These are the already checked logs of one of them in the workflow controller:

❯ kubectl logs -n argo argo-workflow-controller-57dd4b56fb-tw7vg | grep hello-world-jw7z7
time="2022-08-11T15:16:35.981Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:35.993Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:35.994Z" level=info msg="Pod node hello-world-jw7z7 initialized Pending" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.029Z" level=info msg="Created pod: hello-world-jw7z7 (hello-world-jw7z7)" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.029Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.029Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.045Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=47110009 workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.030Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.030Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.031Z" level=info msg="node changed" new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=hello-world-jw7z7 old.message= old.phase=Pending old.progress=0/1
time="2022-08-11T15:16:46.031Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.031Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.045Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=47110191 workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="node unchanged" nodeID=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="node changed" new.message="Error (exit code 2)" new.phase=Error new.progress=0/1 nodeID=hello-world-jw7z7 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-08-11T15:17:09.120Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Updated phase Running -> Error" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Updated message  -> Error (exit code 2)" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Marking workflow completed" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Checking daemoned children of " namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.126Z" level=info msg="cleaning up pod" action=deletePod key=argo/hello-world-jw7z7-1340600742-agent/deletePod
time="2022-08-11T15:17:09.144Z" level=info msg="Workflow update successful" namespace=argo phase=Error resourceVersion=47110635 workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.161Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello-world-jw7z7/labelPodCompleted

The events in the Workflow (K9s output):

Events:
  Type     Reason               Age    From                 Message
  ----     ------               ----   ----                 -------
  Normal   WorkflowRunning      6m31s  workflow-controller  Workflow Running
  Warning  WorkflowFailed       5m57s  workflow-controller  Error (exit code 2)
  Normal   WorkflowNodeRunning  5m57s  workflow-controller  Running node hello-world-jw7z7: Error (exit code 2)
  Warning  WorkflowNodeError    5m57s  workflow-controller  Error node hello-world-jw7z7: Error (exit code 2)

Logs of the wait container in the Pod generated by the workflow:

❯ kubectl logs -n argo hello-world-jw7z7 -c wait
fatal: bad g in signal handler
fatal: bad g in signal handler
sarabala1979 commented 2 years ago

@dacamposol thanks for your response. Are you using emissary executor or other?

dacamposol commented 2 years ago

@dacamposol thanks for your response. Are you using emissary executor or other?

Thank you for checking the issue so fast.

I'm using emissary.

alexec commented 2 years ago

Can you see if this happens on :latest?

juliev0 commented 2 years ago

Hi @dacamposol in your Pod status that you show above, are you sure that's the Workflow Pod? Or is it the Workflow Controller Pod? Your image is: argoproj/workflow-controller:v3.3.8-linux-amd64 but that image should only be for the Controller, not for a Workflow Pod.

dacamposol commented 2 years ago

Hi @dacamposol in your Pod status that you show above, are you sure that's the Workflow Pod? Or is it the Workflow Controller Pod? Your image is: argoproj/workflow-controller:v3.3.8-linux-amd64 but that image should only be for the Controller, not for a Workflow Pod.

Do you mean in the initial message? Yes, I took it from the status of the Pod spammed from the corresponding Workflow.

Can you see if this happens on :latest?

Do you mean with the version 3.4.0?

We have some breaking changes there, so I cannot safely test it right now. Let me the day and I'll take a look.

alexec commented 2 years ago

Do you mean with the version 3.4.0?

no. I mean the :latest image tag

juliev0 commented 2 years ago

I tend to wonder if things were configured strangely. You shouldn't be seeing a Workflow Pod using a workflow-controller Image. You should have 3 containers, "main" using the image in your spec, "wait" using quay.io/argoproj/argoexec image, and an initcontainer "init" also using quay.io/argoproj/argoexec.

dacamposol commented 2 years ago

I tend to wonder if things were configured strangely. You shouldn't be seeing a Workflow Pod using a workflow-controller Image. You should have 3 containers, "main" using the image in your spec, "wait" using quay.io/argoproj/argoexec image, and an initcontainer "init" also using quay.io/argoproj/argoexec.

That was my bad writing the OP. Since we have the images passing through a proxy registry in order to scan and use them within our intranet, I just replaced the name of it for the public one, and it seems that I replaced it wrongly. Effectively the workflow Pod is using the executor image in wait and init.

The real log line from the initial message (concealing the GCR Project) is the following.

- containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
   image: eu.gcr.io/XXXXXXX/com.dacamposol/dockerhub/argo-workflow-executor:3.3.8
   imageID: eu.gcr.io/XXXXXXX/com.dacamposol/dockerhub/argo-workflow-executor@sha256:1ecc5b305fb784271797c14436c30b4b6d23204d603d416fdb0382f62604325f
   lastState: {}
   name: wait
   ready: false
   restartCount: 0
   started: false
   state:
     terminated:
       containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
       exitCode: 2
       finishedAt: "2022-08-10T07:06:08Z"
       reason: Error
       startedAt: "2022-08-10T07:06:07Z"

Sorry for the inconveniences 😅

PS: Again, I mention that this only happens sometimes, majority of the times the workflows work without any issue, but one out of twelve times there is that weird error about fatal: bad g in signal handler. Since we spam thousands of Workflows per day, it was more noticeable in our case.


no. I mean the :latest image tag

It doesn't do anything for any Workflow. It doesn't show any event on the Workflow resource and the logs of the workflow-controller Pod show the following error:

time="2022-09-09T15:45:06.300Z" level=info msg=healthz age=5m0s err="workflow never reconciled: workflow-template-hello-world-2gqkh" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=

I just replaced the image of the related containers to use :latest and also removed the containerRuntimeExecutor key from the config, as it seems it wasn't recognised in that version (3.3.9 I assume?)

dacamposol commented 2 years ago

@alexec just for including more info:

At the moment I downgraded again to the previous image I had (:v3.3.8-linux-amd64), all the stagnant Workflows (20) started. Result after all of them ran: 17 successes and 3 failed with the initially mentioned error.

With the :latest-linux-amd64 tag, only the behaviour from the previous comment could be seen.

juliev0 commented 2 years ago

I see you're using your own copy of our Docker image. I wonder if there is any potential that that somehow got corrupted. Are you able to try removing "--executor-image" from your workflow controller configuration and try using the standard image?

juliev0 commented 2 years ago

Out of curiosity, which version of Argo Workflows were you using before, and were you using the emissary executor or a different one?

Based on somebody else who got the same "bad g in signal handler" error here (not Argo Workflows) I am wondering if it's something particular to your environment. You're running in Google Cloud, right? Using Linux AMD 64 machines? Anything else to clarify?

dacamposol commented 2 years ago

I see you're using your own copy of our Docker image. I wonder if there is any potential that that somehow got corrupted. Are you able to try removing "--executor-image" from your workflow controller configuration and try using the standard image?

I don't know which is what :latest is pointing to, but for the tag :v3.3.8 I compared the digests of the image and they're the same one. Any corruption should have been reflected there.

Based on somebody else who got the same "bad g in signal handler" error https://github.com/loft-sh/devspace/issues/1334 (not Argo Workflows) I am wondering if it's something particular to your environment. You're running in Google Cloud, right? Using Linux AMD 64 machines? Anything else to clarify?

The error bad g in signal handler is actually an error from the signal_unix library of Golang, so probably there are different combinations of people with this issue with different tools.

Concretely the tests mentioned in this thread are in a cluster in GCP, in the europe-west3 zone. The Kubernetes version of the cluster is 1.23.8 and the machines are Linux AMD 64 machines.

Out of curiosity, which version of Argo Workflows were you using before, and were you using the emissary executor or a different one?

We were using an old version, v2.11.7 and the container executor.

juliev0 commented 2 years ago

Thanks for the information.

Can you provide the main container log together with the wait container and Controller logs (full logs)? Also, would you mind providing your workflow-controller-config?

juliev0 commented 2 years ago

@alexec just for including more info:

At the moment I downgraded again to the previous image I had (:v3.3.8-linux-amd64), all the stagnant Workflows (20) started. Result after all of them ran: 17 successes and 3 failed with the initially mentioned error.

With the :latest-linux-amd64 tag, only the behaviour from the previous comment could be seen.

Sorry, can you clarify what you mean by "only the behaviour from the previous comment could be seen"? Which previous comment? (Trying to figure out if something has been resolved between 3.3.8 and latest.)

dacamposol commented 2 years ago

Can you provide the main container log together with the wait container and Controller logs (full logs)? Also, would you mind providing your workflow-controller-config?

Sure. They fail in the init container, so there is not much in the logs:

fatal: bad g in signal handler
fatal: bad g in signal handler
Stream closed EOF for argo/hello-world-8gbgf-560498845 (init)

The main and wait containers are stuck in PodInitializing.

In the Controller, it only shows this after few minutes:

time="2022-09-14T06:55:19.001Z" level=info msg="Deleting garbage collected workflow 'argo/hello-world-8gbgf
time="2022-09-14T06:55:19.028Z" level=info msg="Successfully deleted 'argo/hello-world-8gbgf'"

Also, the config-map of the controller, as required:

Name:         argo-workflow-controller-configmap
Namespace:    argo
Labels:       app.kubernetes.io/component=workflow-controller
              app.kubernetes.io/instance=dxp-argo-987lx-dxp-dev
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=argo-cm
              app.kubernetes.io/part-of=argo-workflows
              helm.sh/chart=argo-1662030514699
Annotations:  <none>

Data
====
config:
----
containerRuntimeExecutor: emissary
artifactRepository:
  gcs:
    bucket: dxp-argo-artifacts
    keyFormat: '{{workflow.namespace}}/{{workflow.name}}/'
    serviceAccountKeySecret:
      key: serviceAccountKey
      name: argo-artifacts-gcs

Sorry, can you clarify what you mean by "only the behaviour from the previous comment could be seen"? Which previous comment? (Trying to figure out if something has been resolved between 3.3.8 and latest.)

I mean that it doesn't do anything, the workflows just don't run.

I refer to this comment:

It doesn't do anything for any Workflow. It doesn't show any event on the Workflow resource and the logs of the workflow-controller Pod show the following error...

alexec commented 2 years ago

There in no point testing on v3.3. There is no fix on that branch, nor is there likely to be. You need to test with :latest tags.

https://www.howtogeek.com/devops/understanding-dockers-latest-tag/

juliev0 commented 2 years ago

@alexec I think he did test with the :latest-linux-amd64 tag and as far as I understand, the behavior is the same

Please correct me if I'm wrong.

@dacamposol Thanks for providing the additional information.

dacamposol commented 2 years ago

There in no point testing on v3.3. There is no fix on that branch, nor is there likely to be. You need to test with :latest tags.

I tested it with :latest-linux-amd64 and the Workflows didn't start at all.

I assume it's due to some breaking change between 3.3.x and latest concerning our concrete workflows.

@juliev0 you're welcome :)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

isubasinghe commented 2 years ago

I ran a 1000 workflows and still was not able to reproduce this issue unfortunately.

@FabianKramm I did some investigation on my end, so downgrading docker for mac to v3.1.0 works fine. It seems issue is related qemu upgrades on latest docker for mac v3.2.0.

Perhaps this is related to the environment you run argo-workflows on rather than argo-workflows related?

dacamposol commented 2 years ago

@isubasinghe it was a problem with the 3.3.x version.

I don't know what changed, but once I upgraded to 3.4.x, the error stopped happening.

BTW, I'm closing this, as the upgrade to the latest version solved the weird issue.