Closed tomaszstachera closed 1 month ago
I've tried with hello-world workflow and issue is the same:
apiVersion: argoproj.io/v1alpha1
kind: Workflow # new type of k8s spec
metadata:
generateName: hello-world- # name of the workflow spec
spec:
entrypoint: hello-world # invoke the hello-world template
templates:
- name: hello-world # name of the template
container:
image: busybox
command: [ echo ]
args: [ "hello world" ]
resources: # limit the resources
limits:
memory: 32Mi
cpu: 100m
tolerations:
- effect: NoSchedule
key: ComputeResources
value: reservedFor
Logs:
workflow-controller time="2024-08-27T13:51:18.919Z" level=info msg="Processing workflow" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Get configmaps 404"
workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Updated phase -> Running" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Pod node hello-world-2jk42 initialized Pending" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: failed to resolve {{`hello-world-2jk42`}}"
workflow-controller time="2024-08-27T13:51:18.924Z" level=error msg="Mark error node" error="failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz nodeName=hello-world-2jk42 workflow=
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="node hello-world-2jk42 phase Pending -> Error" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="node hello-world-2jk42 message: failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="node hello-world-2jk42 finished: 2024-08-27 13:51:18.924977442 +0000 UTC" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=error msg="error in entry template execution" error="failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz workflow=hello-wor
workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: failed to resolve {{`hello-world-2jk42`}}"
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Updated phase Running -> Error" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Updated message -> error in entry template execution: failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz workfl
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Marking workflow completed" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Checking daemoned children of " namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Workflow to be dehydrated" Workflow Size=1254
workflow-controller time="2024-08-27T13:51:18.930Z" level=info msg="cleaning up pod" action=deletePod key=tomasz/hello-world-2jk42-1340600742-agent/deletePod
workflow-controller time="2024-08-27T13:51:18.936Z" level=info msg="Queueing Error workflow tomasz/hello-world-2jk42 for delete in 168h0m0s due to TTL"
workflow-controller time="2024-08-27T13:51:18.936Z" level=info msg="Delete pods 404"
workflow-controller time="2024-08-27T13:51:18.938Z" level=info msg="Update workflows 200"
workflow-controller time="2024-08-27T13:51:18.938Z" level=info msg="Workflow update successful" namespace=tomasz phase=Error resourceVersion=724983576 workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.939Z" level=info msg="Create events 201"
workflow-controller time="2024-08-27T13:51:18.943Z" level=info msg="DeleteCollection workflowtaskresults 200"
My core version is 3.3.8, but I've also tried with the one below.
3.3.8 is outdated and unsupported. KFP recently added support for Argo 3.4.x in https://github.com/kubeflow/pipelines/pull/10568, which is supported
image: gcr.io/ml-pipeline/workflow-controller:v3.3.10-license-compliance
This is not an Argo image, that is a Kubeflow fork, so Argo cannot help you with that.
Currently every pipeline/workflow ends with above error.
We have the same version on other environments and it works there.
If it works in a different environment, that sounds like an environment issue and not an Argo bug. Every Workflow failing in one environment but not another sounds very much like an environment issue.
I've tried with hello-world workflow and issue is the same:
Similarly, Argo runs many tests in CI and has many users; the hello-world workflow certainly works, so this sounds like an environment issue as well. To be explicit, I cannot reproduce that.
- --executor-image - quay.io/argoproj/workflow-controller:latest
The Controller is not an Executor, so that is an incorrect configuration and possibly the source of your errors. I've also never seen an error with the format "{{`ppln-from-vsc-xkhhr`}}" -- the backticks in a template seem invalid -- so this sounds like some very unexpected configuration was given to Argo, which would correctly result in an error.
You also did not provide your Controller ConfigMap, which could also have bugs in it.
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Get configmaps 404" workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
You also have some other issues popping up in your logs that are indicative of misconfigurations.
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
Cannot run any workflow via Kubeflow Pipeline. Every attemp ends with
Non-transient error: failed to resolve <name>
.Currently every pipeline/workflow ends with above error. My core version is 3.3.8, but I've also tried with the one below. We have the same version on other environments and it works there.
Workflow-controller Pod core definition:
Workflow-controller Pod logs for given pipeline:
Sample Kubeflow pipeline:
Version(s)
v3.3.8. v3.5.10
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container