Closed dacamposol closed 2 years ago
@dacamposol can you provide the minimum reproducible workflow and controller logs for this workflow?
Hello @sarabala1979, thanks for your response.
I'm using a simple example workflow as:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: hello-world-
labels:
workflows.argoproj.io/archive-strategy: "false"
annotations:
workflows.argoproj.io/description: |
This is a simple hello world example.
You can also run it in Python: https://couler-proj.github.io/couler/examples/#hello-world
spec:
entrypoint: whalesay
templates:
- name: whalesay
container:
image: docker/whalesay:latest
command: [cowsay]
args: ["hello world"]
I've spammed it around 20 times and two of them had an error. These are the already checked logs of one of them in the workflow controller:
❯ kubectl logs -n argo argo-workflow-controller-57dd4b56fb-tw7vg | grep hello-world-jw7z7
time="2022-08-11T15:16:35.981Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:35.993Z" level=info msg="Updated phase -> Running" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:35.994Z" level=info msg="Pod node hello-world-jw7z7 initialized Pending" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.029Z" level=info msg="Created pod: hello-world-jw7z7 (hello-world-jw7z7)" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.029Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.029Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:36.045Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=47110009 workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.030Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.030Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.031Z" level=info msg="node changed" new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=hello-world-jw7z7 old.message= old.phase=Pending old.progress=0/1
time="2022-08-11T15:16:46.031Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.031Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:46.045Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=47110191 workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="node unchanged" nodeID=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:16:56.047Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Processing workflow" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="node changed" new.message="Error (exit code 2)" new.phase=Error new.progress=0/1 nodeID=hello-world-jw7z7 old.message=PodInitializing old.phase=Pending old.progress=0/1
time="2022-08-11T15:17:09.120Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Updated phase Running -> Error" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Updated message -> Error (exit code 2)" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Marking workflow completed" namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.120Z" level=info msg="Checking daemoned children of " namespace=argo workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.126Z" level=info msg="cleaning up pod" action=deletePod key=argo/hello-world-jw7z7-1340600742-agent/deletePod
time="2022-08-11T15:17:09.144Z" level=info msg="Workflow update successful" namespace=argo phase=Error resourceVersion=47110635 workflow=hello-world-jw7z7
time="2022-08-11T15:17:09.161Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello-world-jw7z7/labelPodCompleted
The events in the Workflow (K9s output):
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WorkflowRunning 6m31s workflow-controller Workflow Running
Warning WorkflowFailed 5m57s workflow-controller Error (exit code 2)
Normal WorkflowNodeRunning 5m57s workflow-controller Running node hello-world-jw7z7: Error (exit code 2)
Warning WorkflowNodeError 5m57s workflow-controller Error node hello-world-jw7z7: Error (exit code 2)
Logs of the wait
container in the Pod generated by the workflow:
❯ kubectl logs -n argo hello-world-jw7z7 -c wait
fatal: bad g in signal handler
fatal: bad g in signal handler
@dacamposol thanks for your response. Are you using emissary
executor or other?
@dacamposol thanks for your response. Are you using
emissary
executor or other?
Thank you for checking the issue so fast.
I'm using emissary
.
Can you see if this happens on :latest
?
Hi @dacamposol in your Pod status that you show above, are you sure that's the Workflow Pod? Or is it the Workflow Controller Pod? Your image is: argoproj/workflow-controller:v3.3.8-linux-amd64 but that image should only be for the Controller, not for a Workflow Pod.
Hi @dacamposol in your Pod status that you show above, are you sure that's the Workflow Pod? Or is it the Workflow Controller Pod? Your image is: argoproj/workflow-controller:v3.3.8-linux-amd64 but that image should only be for the Controller, not for a Workflow Pod.
Do you mean in the initial message? Yes, I took it from the status of the Pod spammed from the corresponding Workflow.
Can you see if this happens on
:latest
?
Do you mean with the version 3.4.0?
We have some breaking changes there, so I cannot safely test it right now. Let me the day and I'll take a look.
Do you mean with the version 3.4.0?
no. I mean the :latest image tag
I tend to wonder if things were configured strangely. You shouldn't be seeing a Workflow Pod using a workflow-controller Image. You should have 3 containers, "main" using the image in your spec, "wait" using quay.io/argoproj/argoexec image, and an initcontainer "init" also using quay.io/argoproj/argoexec.
I tend to wonder if things were configured strangely. You shouldn't be seeing a Workflow Pod using a workflow-controller Image. You should have 3 containers, "main" using the image in your spec, "wait" using quay.io/argoproj/argoexec image, and an initcontainer "init" also using quay.io/argoproj/argoexec.
That was my bad writing the OP. Since we have the images passing through a proxy registry in order to scan and use them within our intranet, I just replaced the name of it for the public one, and it seems that I replaced it wrongly. Effectively the workflow Pod is using the executor image in wait
and init
.
The real log line from the initial message (concealing the GCR Project) is the following.
- containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
image: eu.gcr.io/XXXXXXX/com.dacamposol/dockerhub/argo-workflow-executor:3.3.8
imageID: eu.gcr.io/XXXXXXX/com.dacamposol/dockerhub/argo-workflow-executor@sha256:1ecc5b305fb784271797c14436c30b4b6d23204d603d416fdb0382f62604325f
lastState: {}
name: wait
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://d10fcac735c905dc65717148e04437e32148bea560bb66e9ed35d718471f1bf8
exitCode: 2
finishedAt: "2022-08-10T07:06:08Z"
reason: Error
startedAt: "2022-08-10T07:06:07Z"
Sorry for the inconveniences 😅
PS: Again, I mention that this only happens sometimes, majority of the times the workflows work without any issue, but one out of twelve times there is that weird error about fatal: bad g in signal handler
. Since we spam thousands of Workflows per day, it was more noticeable in our case.
no. I mean the :latest image tag
It doesn't do anything for any Workflow. It doesn't show any event on the Workflow resource and the logs of the workflow-controller
Pod show the following error:
time="2022-09-09T15:45:06.300Z" level=info msg=healthz age=5m0s err="workflow never reconciled: workflow-template-hello-world-2gqkh" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
I just replaced the image of the related containers to use :latest
and also removed the containerRuntimeExecutor
key from the config, as it seems it wasn't recognised in that version (3.3.9
I assume?)
@alexec just for including more info:
At the moment I downgraded again to the previous image I had (:v3.3.8-linux-amd64
), all the stagnant Workflows (20) started. Result after all of them ran: 17 successes and 3 failed with the initially mentioned error.
With the :latest-linux-amd64
tag, only the behaviour from the previous comment could be seen.
I see you're using your own copy of our Docker image. I wonder if there is any potential that that somehow got corrupted. Are you able to try removing "--executor-image" from your workflow controller configuration and try using the standard image?
Out of curiosity, which version of Argo Workflows were you using before, and were you using the emissary executor or a different one?
Based on somebody else who got the same "bad g in signal handler" error here (not Argo Workflows) I am wondering if it's something particular to your environment. You're running in Google Cloud, right? Using Linux AMD 64 machines? Anything else to clarify?
I see you're using your own copy of our Docker image. I wonder if there is any potential that that somehow got corrupted. Are you able to try removing "--executor-image" from your workflow controller configuration and try using the standard image?
I don't know which is what :latest
is pointing to, but for the tag :v3.3.8
I compared the digests of the image and they're the same one. Any corruption should have been reflected there.
Based on somebody else who got the same "bad g in signal handler" error https://github.com/loft-sh/devspace/issues/1334 (not Argo Workflows) I am wondering if it's something particular to your environment. You're running in Google Cloud, right? Using Linux AMD 64 machines? Anything else to clarify?
The error bad g in signal handler
is actually an error from the signal_unix
library of Golang, so probably there are different combinations of people with this issue with different tools.
Concretely the tests mentioned in this thread are in a cluster in GCP, in the europe-west3
zone. The Kubernetes version of the cluster is 1.23.8
and the machines are Linux AMD 64 machines.
Out of curiosity, which version of Argo Workflows were you using before, and were you using the emissary executor or a different one?
We were using an old version, v2.11.7
and the container
executor.
Thanks for the information.
Can you provide the main container log together with the wait container and Controller logs (full logs)? Also, would you mind providing your workflow-controller-config?
@alexec just for including more info:
At the moment I downgraded again to the previous image I had (
:v3.3.8-linux-amd64
), all the stagnant Workflows (20) started. Result after all of them ran: 17 successes and 3 failed with the initially mentioned error.With the
:latest-linux-amd64
tag, only the behaviour from the previous comment could be seen.
Sorry, can you clarify what you mean by "only the behaviour from the previous comment could be seen"? Which previous comment? (Trying to figure out if something has been resolved between 3.3.8 and latest.)
Can you provide the main container log together with the wait container and Controller logs (full logs)? Also, would you mind providing your workflow-controller-config?
Sure. They fail in the init
container, so there is not much in the logs:
fatal: bad g in signal handler
fatal: bad g in signal handler
Stream closed EOF for argo/hello-world-8gbgf-560498845 (init)
The main
and wait
containers are stuck in PodInitializing
.
In the Controller, it only shows this after few minutes:
time="2022-09-14T06:55:19.001Z" level=info msg="Deleting garbage collected workflow 'argo/hello-world-8gbgf
time="2022-09-14T06:55:19.028Z" level=info msg="Successfully deleted 'argo/hello-world-8gbgf'"
Also, the config-map of the controller, as required:
Name: argo-workflow-controller-configmap
Namespace: argo
Labels: app.kubernetes.io/component=workflow-controller
app.kubernetes.io/instance=dxp-argo-987lx-dxp-dev
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=argo-cm
app.kubernetes.io/part-of=argo-workflows
helm.sh/chart=argo-1662030514699
Annotations: <none>
Data
====
config:
----
containerRuntimeExecutor: emissary
artifactRepository:
gcs:
bucket: dxp-argo-artifacts
keyFormat: '{{workflow.namespace}}/{{workflow.name}}/'
serviceAccountKeySecret:
key: serviceAccountKey
name: argo-artifacts-gcs
Sorry, can you clarify what you mean by "only the behaviour from the previous comment could be seen"? Which previous comment? (Trying to figure out if something has been resolved between 3.3.8 and latest.)
I mean that it doesn't do anything, the workflows just don't run.
I refer to this comment:
It doesn't do anything for any Workflow. It doesn't show any event on the Workflow resource and the logs of the workflow-controller Pod show the following error...
There in no point testing on v3.3. There is no fix on that branch, nor is there likely to be. You need to test with :latest
tags.
https://www.howtogeek.com/devops/understanding-dockers-latest-tag/
@alexec I think he did test with the :latest-linux-amd64 tag and as far as I understand, the behavior is the same
Please correct me if I'm wrong.
@dacamposol Thanks for providing the additional information.
There in no point testing on v3.3. There is no fix on that branch, nor is there likely to be. You need to test with :latest tags.
I tested it with :latest-linux-amd64
and the Workflows didn't start at all.
I assume it's due to some breaking change between 3.3.x
and latest
concerning our concrete workflows.
@juliev0 you're welcome :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
I ran a 1000 workflows and still was not able to reproduce this issue unfortunately.
@FabianKramm I did some investigation on my end, so downgrading docker for mac to v3.1.0 works fine. It seems issue is related qemu upgrades on latest docker for mac v3.2.0.
Perhaps this is related to the environment you run argo-workflows on rather than argo-workflows related?
@isubasinghe it was a problem with the 3.3.x version.
I don't know what changed, but once I upgraded to 3.4.x, the error stopped happening.
BTW, I'm closing this, as the upgrade to the latest version solved the weird issue.
Checklist
Summary
We have updated Argo Workflow to the version 3.3.8 and lately, we noticed some of the Workflows are failing due to the Pods being terminated.
When I analyse the Pod, I noticed the following:
Also, when I check the logs of the aforementioned
wait
container, I don't see anything else than the following two lines:It's not a consistent error, and I have checked that it happens around 1/12 times, but it's weird and I don't see the way of fixing it.
What version are you running?
I'm running Argo Workflows 3.3.8, deployed by Helm in our cluster.
Diagnostics
Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.