argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.2k forks source link

Non-transient error: <nil> #13881

Open black-snow opened 1 week ago

black-snow commented 1 week ago

Pre-requisites

What happened? What did you expect to happen?

I encounter a lot of these:

Non-transient error: <nil>

and sadly there's not much context I can give. Actually, I can give close to nil context. ;)

We run workflows with up to 250 nodes and thousands of concurrent pods. The above log is apparenty just a warning but the nil there seems worrisome. Should I worry?

Sadly, I cannot provide proper steps to reproduce. This ain't very helpful but perhaps someone else is seeing the same or has an immediate idea.

Version(s)

v3.5.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

time="2024-11-08T10:02:19.639Z" level=warning msg="Non-transient error: <nil>"

Logs from in your workflow's wait container

N/A
tooptoop4 commented 1 week ago

@black-snow provide the controller logs surrounding that line

isubasinghe commented 1 week ago

This is not really actually an issue, there is a bug where we call IsTransientError on an nil error.

black-snow commented 1 week ago

Thanks for looking into this. Do you still need the surrounding logs? Can we get rid of them if they are false positives?

sarabala1979 commented 1 week ago

@black-snow Do you like to contribute? It is a good first issue

tooptoop4 commented 1 week ago

@black-snow the surrounding logs can help to know which line to change

black-snow commented 1 week ago

@sarabala1979 absolutely. Sounds like @isubasinghe already has a notion of where to look.

MasonM commented 2 days ago

@black-snow No need for the logs. I can reproduce this locally and entered https://github.com/argoproj/argo-workflows/pull/13917 with a fix.