argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.98k stars 3.19k forks source link

Kubeflow Pipelines: `Non-transient error: failed to resolve {{`ppln-from-vsc-xkhhr`}}` #13512

Closed tomaszstachera closed 1 month ago

tomaszstachera commented 1 month ago

Pre-requisites

What happened? What did you expect to happen?

Cannot run any workflow via Kubeflow Pipeline. Every attemp ends with Non-transient error: failed to resolve <name>.

Currently every pipeline/workflow ends with above error. My core version is 3.3.8, but I've also tried with the one below. We have the same version on other environments and it works there.

Workflow-controller Pod core definition:

spec:
  containers:
  - args:
    - --configmap
    - workflow-controller-configmap
    - --executor-image
    - quay.io/argoproj/workflow-controller:latest
    command:
    - workflow-controller
    env:
    - name: LEADER_ELECTION_IDENTITY
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: gcr.io/ml-pipeline/workflow-controller:v3.3.10-license-compliance

Workflow-controller Pod logs for given pipeline:

workflow-controller time="2024-08-27T12:38:03.175Z" level=info msg="Processing workflow" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.176Z" level=info msg="Task-result reconciliation" namespace=tomasz numObjs=0 workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.176Z" level=info msg="All of node ppln-from-vsc-xkhhr.sample-op dependencies [] completed" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.177Z" level=info msg="Pod node ppln-from-vsc-xkhhr-1425665423 initialized Pending" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=warning msg="Non-transient error: failed to resolve {{`ppln-from-vsc-xkhhr`}}"
workflow-controller time="2024-08-27T12:38:03.178Z" level=error msg="Mark error node" error="failed to resolve {{`ppln-from-vsc-xkhhr`}}" namespace=tomasz nodeName=ppln-from-vsc-xkhhr.sampl
e-op workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 phase Pending -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 message: failed to resolve {{`ppln-from-vsc-xkhhr`}}" namespace=tomasz workflow=ppln-
from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 finished: 2024-08-27 12:38:03.178501556 +0000 UTC" namespace=tomasz workflow=ppln-fro
m-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=error msg="Mark error node" error="task 'ppln-from-vsc-xkhhr.sample-op' errored: failed to resolve {{`ppln-from-vsc-xkhhr`}}" names
pace=tomasz nodeName=ppln-from-vsc-xkhhr.sample-op workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr-1425665423 message: task 'ppln-from-vsc-xkhhr.sample-op' errored: failed to resolve {{`ppln-from
-vsc-xkhhr`}}" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Skipped node ppln-from-vsc-xkhhr-184939484 initialized Omitted (message: omitted: depends condition not met)" namespace=t
omasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Outbound nodes of ppln-from-vsc-xkhhr set to [ppln-from-vsc-xkhhr-184939484]" namespace=tomasz workflow=ppln-from-vsc-xkh
hr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="node ppln-from-vsc-xkhhr finished: 2024-08-27 12:38:03.178728796 +0000 UTC" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Checking daemoned children of ppln-from-vsc-xkhhr" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="TaskSet Reconciliation" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg=reconcileAgentPod namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Updated phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Marking workflow completed" namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Checking daemoned children of " namespace=tomasz workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.178Z" level=info msg="Workflow to be dehydrated" Workflow Size=9953
workflow-controller time="2024-08-27T12:38:03.184Z" level=info msg="cleaning up pod" action=deletePod key=tomasz/ppln-from-vsc-xkhhr-1340600742-agent/deletePod
workflow-controller time="2024-08-27T12:38:03.188Z" level=info msg="Update workflows 200"
workflow-controller time="2024-08-27T12:38:03.189Z" level=info msg="Workflow update successful" namespace=tomasz phase=Error resourceVersion=724918529 workflow=ppln-from-vsc-xkhhr
workflow-controller time="2024-08-27T12:38:03.189Z" level=info msg="Queueing Error workflow tomasz/ppln-from-vsc-xkhhr for delete in 168h0m0s due to TTL"
workflow-controller time="2024-08-27T12:38:03.195Z" level=info msg="Delete pods 404"
workflow-controller time="2024-08-27T12:38:03.196Z" level=info msg="DeleteCollection workflowtaskresults 200"
workflow-controller time="2024-08-27T12:38:03.197Z" level=info msg="Patch events 200"

Sample Kubeflow pipeline:

import kfp
from kubernetes.client.models.v1_toleration import V1Toleration

# to discard effect of taint on a node
toleration = V1Toleration(effect='NoSchedule', key='ComputeResources', value='reservedFor')

def sample_op(in_var: int) -> int:
    print(in_var)
    return in_var

sample_comp = kfp.components.func_to_container_op(
    func=sample_op,
    base_image='python:3.10-slim-buster',
)
@kfp.dsl.pipeline(
    name='ppln-from-vsc',
    description='A pipeline'
)
def ppln_from_vsc():
    ret = sample_comp(1234).set_memory_request('25Mi').set_memory_limit('100Mi').set_cpu_request('25m').set_cpu_limit('50m').add_toleration(toleration)
    sample_comp(ret.output).set_memory_request('25Mi').set_memory_limit('100Mi').set_cpu_request('25m').set_cpu_limit('50m').add_toleration(toleration)

client = kfp.Client()
resp = client.create_run_from_pipeline_func(
    ppln_from_vsc,
    arguments={},
    # namespace='tomasz'
)

Version(s)

v3.3.8. v3.5.10

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    pipelines.kubeflow.org/kfp_sdk_version: 1.8.21
    pipelines.kubeflow.org/pipeline_compilation_time: 2024-08-27T13:20:30.555863
    pipelines.kubeflow.org/pipeline_spec: '{"description": "A pipeline", "name": "ppln-from-vsc"}'
    pipelines.kubeflow.org/run_name: ppln_from_vsc 2024-08-27 13-20-30
    workflows.argoproj.io/pod-name-format: v1
  creationTimestamp: "2024-08-27T13:20:30Z"
  generateName: ppln-from-vsc-
  generation: 2
  labels:
    pipeline/persistedFinalState: "true"
    pipeline/runid: e374b471-bdf7-4daa-8f1f-fec917727743
    pipelines.kubeflow.org/kfp_sdk_version: 1.8.21
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/phase: Error
  name: ppln-from-vsc-hfncg
  namespace: tomasz
  resourceVersion: "724956422"
  uid: 06d3b608-d659-41c2-859b-74e7e80f64f5
spec:
  arguments: {}
  entrypoint: ppln-from-vsc
  podMetadata:
    labels:
      pipeline/runid: e374b471-bdf7-4daa-8f1f-fec917727743
  serviceAccountName: default-editor
  templates:
  - dag:
      tasks:
      - arguments: {}
        name: sample-op
        template: sample-op
      - arguments:
          parameters:
          - name: sample-op-Output
            value: '{{tasks.sample-op.outputs.parameters.sample-op-Output}}'
        dependencies:
        - sample-op
        name: sample-op-2
        template: sample-op-2
    inputs: {}
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        pipelines.kubeflow.org/cache_enabled: "true"
    name: ppln-from-vsc
    outputs: {}
  - container:
      args:
      - --in-var
      - "1234"
      - '----output-paths'
      - /tmp/outputs/Output/data
      command:
      - sh
      - -ec
      - |
        program_path=$(mktemp)
        printf "%s" "$0" > "$program_path"
        python3 -u "$program_path" "$@"
      - |
        def sample_op(in_var):
            print(in_var)
            return in_var

        def _serialize_int(int_value: int) -> str:
            if isinstance(int_value, str):
                return int_value
            if not isinstance(int_value, int):
                raise TypeError('Value "{}" has type "{}" instead of int.'.format(
                    str(int_value), str(type(int_value))))
            return str(int_value)

        import argparse
        _parser = argparse.ArgumentParser(prog='Sample op', description='')
        _parser.add_argument("--in-var", dest="in_var", type=int, required=True, default=argparse.SUPPRESS)
        _parser.add_argument("----output-paths", dest="_output_paths", type=str, nargs=1)
        _parsed_args = vars(_parser.parse_args())
        _output_files = _parsed_args.pop("_output_paths", [])

        _outputs = sample_op(**_parsed_args)

        _outputs = [_outputs]

        _output_serializers = [
            _serialize_int,

        ]

        import os
        for idx, output_file in enumerate(_output_files):
            try:
                os.makedirs(os.path.dirname(output_file))
            except OSError:
                pass
            with open(output_file, 'w') as f:
                f.write(_output_serializers[idx](_outputs[idx]))
      image: python:3.10-slim-buster
      name: ""
      resources:
        limits:
          cpu: 50m
          memory: 100Mi
        requests:
          cpu: 25m
          memory: 25Mi
    inputs: {}
    metadata:
      annotations:
        pipelines.kubeflow.org/arguments.parameters: '{"in_var": "1234"}'
        pipelines.kubeflow.org/component_ref: '{}'
        pipelines.kubeflow.org/component_spec: '{"implementation": {"container": {"args":
          ["--in-var", {"inputValue": "in_var"}, "----output-paths", {"outputPath":
          "Output"}], "command": ["sh", "-ec", "program_path=$(mktemp)\nprintf \"%s\"
          \"$0\" > \"$program_path\"\npython3 -u \"$program_path\" \"$@\"\n", "def
          sample_op(in_var):\n    print(in_var)\n    return in_var\n\ndef _serialize_int(int_value:
          int) -> str:\n    if isinstance(int_value, str):\n        return int_value\n    if
          not isinstance(int_value, int):\n        raise TypeError(''Value \"{}\"
          has type \"{}\" instead of int.''.format(\n            str(int_value), str(type(int_value))))\n    return
          str(int_value)\n\nimport argparse\n_parser = argparse.ArgumentParser(prog=''Sample
          op'', description='''')\n_parser.add_argument(\"--in-var\", dest=\"in_var\",
          type=int, required=True, default=argparse.SUPPRESS)\n_parser.add_argument(\"----output-paths\",
          dest=\"_output_paths\", type=str, nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files
          = _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = sample_op(**_parsed_args)\n\n_outputs
          = [_outputs]\n\n_output_serializers = [\n    _serialize_int,\n\n]\n\nimport
          os\nfor idx, output_file in enumerate(_output_files):\n    try:\n        os.makedirs(os.path.dirname(output_file))\n    except
          OSError:\n        pass\n    with open(output_file, ''w'') as f:\n        f.write(_output_serializers[idx](_outputs[idx]))\n"],
          "image": "python:3.10-slim-buster"}}, "inputs": [{"name": "in_var", "type":
          "Integer"}], "name": "Sample op", "outputs": [{"name": "Output", "type":
          "Integer"}]}'
        sidecar.istio.io/inject: "false"
      labels:
        pipelines.kubeflow.org/cache_enabled: "true"
        pipelines.kubeflow.org/enable_caching: "true"
        pipelines.kubeflow.org/kfp_sdk_version: 1.8.21
        pipelines.kubeflow.org/pipeline-sdk-type: kfp
    name: sample-op
    outputs:
      artifacts:
      - name: sample-op-Output
        path: /tmp/outputs/Output/data
      parameters:
      - name: sample-op-Output
        valueFrom:
          path: /tmp/outputs/Output/data
    tolerations:
    - effect: NoSchedule
      key: ComputeResources
      value: reservedFor
  - container:
      args:
      - --in-var
      - '{{inputs.parameters.sample-op-Output}}'
      - '----output-paths'
      - /tmp/outputs/Output/data
      command:
      - sh
      - -ec
      - |
        program_path=$(mktemp)
        printf "%s" "$0" > "$program_path"
        python3 -u "$program_path" "$@"
      - |
        def sample_op(in_var):
            print(in_var)
            return in_var

        def _serialize_int(int_value: int) -> str:
            if isinstance(int_value, str):
                return int_value
            if not isinstance(int_value, int):
                raise TypeError('Value "{}" has type "{}" instead of int.'.format(
                    str(int_value), str(type(int_value))))
            return str(int_value)

        import argparse
        _parser = argparse.ArgumentParser(prog='Sample op', description='')
        _parser.add_argument("--in-var", dest="in_var", type=int, required=True, default=argparse.SUPPRESS)
        _parser.add_argument("----output-paths", dest="_output_paths", type=str, nargs=1)
        _parsed_args = vars(_parser.parse_args())
        _output_files = _parsed_args.pop("_output_paths", [])

        _outputs = sample_op(**_parsed_args)

        _outputs = [_outputs]

        _output_serializers = [
            _serialize_int,

        ]

        import os
        for idx, output_file in enumerate(_output_files):
            try:
                os.makedirs(os.path.dirname(output_file))
            except OSError:
                pass
            with open(output_file, 'w') as f:
                f.write(_output_serializers[idx](_outputs[idx]))
      image: python:3.10-slim-buster
      name: ""
      resources:
        limits:
          cpu: 50m
          memory: 100Mi
        requests:
          cpu: 25m
          memory: 25Mi
    inputs:
      parameters:
      - name: sample-op-Output
    metadata:
      annotations:
        pipelines.kubeflow.org/arguments.parameters: '{"in_var": "{{inputs.parameters.sample-op-Output}}"}'
        pipelines.kubeflow.org/component_ref: '{}'
        pipelines.kubeflow.org/component_spec: '{"implementation": {"container": {"args":
          ["--in-var", {"inputValue": "in_var"}, "----output-paths", {"outputPath":
          "Output"}], "command": ["sh", "-ec", "program_path=$(mktemp)\nprintf \"%s\"
          \"$0\" > \"$program_path\"\npython3 -u \"$program_path\" \"$@\"\n", "def
          sample_op(in_var):\n    print(in_var)\n    return in_var\n\ndef _serialize_int(int_value:
          int) -> str:\n    if isinstance(int_value, str):\n        return int_value\n    if
          not isinstance(int_value, int):\n        raise TypeError(''Value \"{}\"
          has type \"{}\" instead of int.''.format(\n            str(int_value), str(type(int_value))))\n    return
          str(int_value)\n\nimport argparse\n_parser = argparse.ArgumentParser(prog=''Sample
          op'', description='''')\n_parser.add_argument(\"--in-var\", dest=\"in_var\",
          type=int, required=True, default=argparse.SUPPRESS)\n_parser.add_argument(\"----output-paths\",
          dest=\"_output_paths\", type=str, nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files
          = _parsed_args.pop(\"_output_paths\", [])\n\n_outputs = sample_op(**_parsed_args)\n\n_outputs
          = [_outputs]\n\n_output_serializers = [\n    _serialize_int,\n\n]\n\nimport
          os\nfor idx, output_file in enumerate(_output_files):\n    try:\n        os.makedirs(os.path.dirname(output_file))\n    except
          OSError:\n        pass\n    with open(output_file, ''w'') as f:\n        f.write(_output_serializers[idx](_outputs[idx]))\n"],
          "image": "python:3.10-slim-buster"}}, "inputs": [{"name": "in_var", "type":
          "Integer"}], "name": "Sample op", "outputs": [{"name": "Output", "type":
          "Integer"}]}'
        sidecar.istio.io/inject: "false"
      labels:
        pipelines.kubeflow.org/cache_enabled: "true"
        pipelines.kubeflow.org/enable_caching: "true"
        pipelines.kubeflow.org/kfp_sdk_version: 1.8.21
        pipelines.kubeflow.org/pipeline-sdk-type: kfp
    name: sample-op-2
    outputs:
      artifacts:
      - name: sample-op-2-Output
        path: /tmp/outputs/Output/data
    tolerations:
    - effect: NoSchedule
      key: ComputeResources
      value: reservedFor
  ttlStrategy:
    secondsAfterCompletion: 604800

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

time="2024-08-27T13:20:30.875Z" level=info msg="Processing workflow" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.883Z" level=info msg="Updated phase  -> Running" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.884Z" level=info msg="DAG node ppln-from-vsc-hfncg initialized Running" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.884Z" level=info msg="All of node ppln-from-vsc-hfncg.sample-op dependencies [] completed" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.884Z" level=info msg="Pod node ppln-from-vsc-hfncg-3546699776 initialized Pending" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=warning msg="Non-transient error: failed to resolve {{`ppln-from-vsc-hfncg`}}"
time="2024-08-27T13:20:30.886Z" level=error msg="Mark error node" error="failed to resolve {{`ppln-from-vsc-hfncg`}}" namespace=tomasz nodeName=ppln-from-vsc-hfncg.sample-op workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="node ppln-from-vsc-hfncg-3546699776 phase Pending -> Error" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="node ppln-from-vsc-hfncg-3546699776 message: failed to resolve {{`ppln-from-vsc-hfncg`}}" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="node ppln-from-vsc-hfncg-3546699776 finished: 2024-08-27 13:20:30.88643039 +0000 UTC" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=error msg="Mark error node" error="task 'ppln-from-vsc-hfncg.sample-op' errored: failed to resolve {{`ppln-from-vsc-hfncg`}}" namespace=tomasz nodeName=ppln-from-vsc-hfncg.sample-op workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="node ppln-from-vsc-hfncg-3546699776 message: task 'ppln-from-vsc-hfncg.sample-op' errored: failed to resolve {{`ppln-from-vsc-hfncg`}}" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="Skipped node ppln-from-vsc-hfncg-3881415295 initialized Omitted (message: omitted: depends condition not met)" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="Outbound nodes of ppln-from-vsc-hfncg set to [ppln-from-vsc-hfncg-3881415295]" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="node ppln-from-vsc-hfncg phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="node ppln-from-vsc-hfncg finished: 2024-08-27 13:20:30.886740515 +0000 UTC" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="Checking daemoned children of ppln-from-vsc-hfncg" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="TaskSet Reconciliation" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg=reconcileAgentPod namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="Updated phase Running -> Error" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="Marking workflow completed" namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.886Z" level=info msg="Checking daemoned children of " namespace=tomasz workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.892Z" level=info msg="cleaning up pod" action=deletePod key=tomasz/ppln-from-vsc-hfncg-1340600742-agent/deletePod
time="2024-08-27T13:20:30.897Z" level=info msg="Workflow update successful" namespace=tomasz phase=Error resourceVersion=724956404 workflow=ppln-from-vsc-hfncg
time="2024-08-27T13:20:30.898Z" level=info msg="Queueing Error workflow tomasz/ppln-from-vsc-hfncg for delete in 168h0m0s due to TTL"
time="2024-08-27T13:20:31.910Z" level=info msg="Queueing Error workflow tomasz/ppln-from-vsc-hfncg for delete in 167h59m59s due to TTL"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

Workflow not started - no logs.
tomaszstachera commented 1 month ago

I've tried with hello-world workflow and issue is the same:

apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: hello-world       # invoke the hello-world template
  templates:
    - name: hello-world         # name of the template
      container:
        image: busybox
        command: [ echo ]
        args: [ "hello world" ]
        resources: # limit the resources
          limits:
            memory: 32Mi
            cpu: 100m
      tolerations:
      - effect: NoSchedule
        key: ComputeResources
        value: reservedFor

Logs:

workflow-controller time="2024-08-27T13:51:18.919Z" level=info msg="Processing workflow" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Get configmaps 404"
workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Updated phase  -> Running" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Pod node hello-world-2jk42 initialized Pending" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: failed to resolve {{`hello-world-2jk42`}}"
workflow-controller time="2024-08-27T13:51:18.924Z" level=error msg="Mark error node" error="failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz nodeName=hello-world-2jk42 workflow=
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="node hello-world-2jk42 phase Pending -> Error" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="node hello-world-2jk42 message: failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="node hello-world-2jk42 finished: 2024-08-27 13:51:18.924977442 +0000 UTC" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.924Z" level=error msg="error in entry template execution" error="failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz workflow=hello-wor
workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: failed to resolve {{`hello-world-2jk42`}}"
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Updated phase Running -> Error" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Updated message  -> error in entry template execution: failed to resolve {{`hello-world-2jk42`}}" namespace=tomasz workfl
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Marking workflow completed" namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Checking daemoned children of " namespace=tomasz workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.925Z" level=info msg="Workflow to be dehydrated" Workflow Size=1254
workflow-controller time="2024-08-27T13:51:18.930Z" level=info msg="cleaning up pod" action=deletePod key=tomasz/hello-world-2jk42-1340600742-agent/deletePod
workflow-controller time="2024-08-27T13:51:18.936Z" level=info msg="Queueing Error workflow tomasz/hello-world-2jk42 for delete in 168h0m0s due to TTL"
workflow-controller time="2024-08-27T13:51:18.936Z" level=info msg="Delete pods 404"
workflow-controller time="2024-08-27T13:51:18.938Z" level=info msg="Update workflows 200"
workflow-controller time="2024-08-27T13:51:18.938Z" level=info msg="Workflow update successful" namespace=tomasz phase=Error resourceVersion=724983576 workflow=hello-world-2jk42
workflow-controller time="2024-08-27T13:51:18.939Z" level=info msg="Create events 201"
workflow-controller time="2024-08-27T13:51:18.943Z" level=info msg="DeleteCollection workflowtaskresults 200"
agilgur5 commented 1 month ago

My core version is 3.3.8, but I've also tried with the one below.

3.3.8 is outdated and unsupported. KFP recently added support for Argo 3.4.x in https://github.com/kubeflow/pipelines/pull/10568, which is supported

    image: gcr.io/ml-pipeline/workflow-controller:v3.3.10-license-compliance

This is not an Argo image, that is a Kubeflow fork, so Argo cannot help you with that.

Currently every pipeline/workflow ends with above error.

We have the same version on other environments and it works there.

If it works in a different environment, that sounds like an environment issue and not an Argo bug. Every Workflow failing in one environment but not another sounds very much like an environment issue.

I've tried with hello-world workflow and issue is the same:

Similarly, Argo runs many tests in CI and has many users; the hello-world workflow certainly works, so this sounds like an environment issue as well. To be explicit, I cannot reproduce that.

    - --executor-image
    - quay.io/argoproj/workflow-controller:latest

The Controller is not an Executor, so that is an incorrect configuration and possibly the source of your errors. I've also never seen an error with the format "{{`ppln-from-vsc-xkhhr`}}" -- the backticks in a template seem invalid -- so this sounds like some very unexpected configuration was given to Argo, which would correctly result in an error.

You also did not provide your Controller ConfigMap, which could also have bugs in it.

workflow-controller time="2024-08-27T13:51:18.924Z" level=info msg="Get configmaps 404"
workflow-controller time="2024-08-27T13:51:18.924Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"

You also have some other issues popping up in your logs that are indicative of misconfigurations.