Workflow-controller was unable to obtain node (running on local computer / KinD)

julienteisseire commented 3 weeks ago

Pre-requisites

[X] I have double-checked my configuration
[X] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
[X] I have searched existing issues and could not find a match for this bug
[X] I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Hi,

I submit my workflow with few HTTP calls.

I have the following error level=error msg="was unable to obtain node for.

Regarding workflow-controller logs, I see the following logs :

time="2024-10-31T13:01:27.783Z" level=error msg="was unable to obtain node for rinex-ingester-lb5wn-2166136261" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:27.784Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:37.802Z" level=error msg="was unable to obtain node for rinex-ingester-lb5wn-2166136261" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:37.803Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:47.867Z" level=error msg="was unable to obtain node for rinex-ingester-lb5wn-2166136261" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:47.872Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:47.872Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:57.917Z" level=error msg="was unable to obtain node for rinex-ingester-lb5wn-2166136261" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:57.918Z" level=debug msg="unresolved is allowed " error=unresolved

I guess it causes lag and overtime consuming (50 seconds workflow duration for 2 HTTP calls).

Name:                rinex-ingester-lb5wn
Namespace:           commanding
ServiceAccount:      argo-workflow
Status:              Succeeded
Conditions:
 PodRunning          False
 Completed           True
Created:             Thu Oct 31 14:01:17 +0100 (50 seconds ago)
Started:             Thu Oct 31 14:01:17 +0100 (50 seconds ago)
Finished:            Thu Oct 31 14:02:07 +0100 (now)
Duration:            50 seconds
Progress:            2/2

STEP                     TEMPLATE  PODNAME  DURATION  MESSAGE
 ✔ rinex-ingester-lb5wn  main
 ├───✔ rinex-probe       http-get
 └───✔ rinex-start       http-put

Even if I see correct output, I would like to understand why this node error while using only one node in KinD.

I can understand agent POD is not existing and has to be created first, but why so many was unable to obtain node logs ? And why not to maintain the agent alive in order to avoid creating it at workflow submit ?

Thank you

Version(s)

v3.5.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Example of HTTP template workflow (mine is using private image so that cannot be used to reproduced, but same problem with both).

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: http-template-
  namespace: commanding
  labels:
    workflows.argoproj.io/test: "true"
  annotations:
    workflows.argoproj.io/description: |
      Http template will demostrate http template functionality
    workflows.argoproj.io/version: '>= 3.2.0'
spec:
  entrypoint: main
  serviceAccountName: argo-workflow
  templates:
    - name: main
      steps:
        - - name: good
            template: http
            arguments:
              parameters: [{name: url, value: "https://raw.githubusercontent.com/argoproj/argo-workflows/4e450e250168e6b4d51a126b784e90b11a0162bc/pkg/apis/workflow/v1alpha1/generated.swagger.json"}]
          - name: bad
            template: http
            continueOn:
              failed: true
            arguments:
              parameters: [{name: url, value: "https://raw.githubusercontent.com/argoproj/argo-workflows/thisisnotahash/pkg/apis/workflow/v1alpha1/generated.swagger.json"}]

    - name: http
      inputs:
        parameters:
          - name: url
      http:
       # url: http://dummy.restapiexample.com/api/v1/employees
       url: "{{inputs.parameters.url}}"

Logs from the workflow controller

time="2024-10-31T13:01:47.866Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2073 namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.866Z" level=info msg="Task-result reconciliation" namespace=commanding numObjs=0 workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.866Z" level=debug msg="Skipping artifact GC" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.866Z" level=info msg=updateAgentPodStatus namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.866Z" level=info msg=assessAgentPodStatus namespace=commanding podName=rinex-ingester-lb5wn-1340600742-agent
time="2024-10-31T13:01:47.867Z" level=error msg="was unable to obtain node for rinex-ingester-lb5wn-2166136261" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.867Z" level=debug msg="Evaluating node rinex-ingester-lb5wn: template: *v1alpha1.WorkflowStep (main), boundaryID: , stack depth: 0" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.867Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-10-31T13:01:47.868Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-10-31T13:01:47.868Z" level=debug msg="Getting the template by name: main" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-10-31T13:01:47.870Z" level=debug msg="Executing node rinex-ingester-lb5wn of Steps is Running" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.871Z" level=debug msg="Evaluating node rinex-ingester-lb5wn[0].rinex-probe: template: *v1alpha1.WorkflowStep (http-get), boundaryID: rinex-ingester-lb5wn, stack depth: 1" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.871Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-get)"
time="2024-10-31T13:01:47.871Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-get)"
time="2024-10-31T13:01:47.871Z" level=debug msg="Getting the template by name: http-get" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-get)"
time="2024-10-31T13:01:47.872Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:47.872Z" level=debug msg="Node rinex-ingester-lb5wn[0].rinex-probe already completed" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=info msg="Step group node rinex-ingester-lb5wn-2342179342 successful" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=info msg="node rinex-ingester-lb5wn-2342179342 phase Running -> Succeeded" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=info msg="node rinex-ingester-lb5wn-2342179342 finished: 2024-10-31 13:01:47.872205423 +0000 UTC" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=debug msg="Initializing node rinex-ingester-lb5wn[1]: template: *v1alpha1.WorkflowStep invalid (https://argo-workflows.readthedocs.io/en/release-3.5/templates/), boundaryID: rinex-ingester-lb5wn" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=info msg="StepGroup node rinex-ingester-lb5wn-1201154155 initialized Running" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=info msg="SG Outbound nodes of rinex-ingester-lb5wn-2967083535 are [rinex-ingester-lb5wn-2967083535]" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=debug msg="Evaluating node rinex-ingester-lb5wn[1].rinex-start: template: *v1alpha1.WorkflowStep (http-put), boundaryID: rinex-ingester-lb5wn, stack depth: 1" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-put)"
time="2024-10-31T13:01:47.872Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-put)"
time="2024-10-31T13:01:47.872Z" level=debug msg="Getting the template by name: http-put" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-put)"
time="2024-10-31T13:01:47.872Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:47.872Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:47.872Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-10-31T13:01:47.872Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-10-31T13:01:47.872Z" level=debug msg="Getting the template by name: main" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.NodeStatus (main)"
time="2024-10-31T13:01:47.872Z" level=debug msg="Initializing node rinex-ingester-lb5wn[1].rinex-start: template: *v1alpha1.WorkflowStep (http-put), boundaryID: rinex-ingester-lb5wn" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=info msg="HTTP node rinex-ingester-lb5wn-1420861584 initialized Pending" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.872Z" level=info msg="Workflow step group node rinex-ingester-lb5wn-1201154155 not yet completed" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.873Z" level=info msg="TaskSet Reconciliation" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.873Z" level=info msg="Creating TaskSet" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.873Z" level=debug msg="creating new taskset" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.907Z" level=debug msg="Create workflowtasksets 409"
time="2024-10-31T13:01:47.907Z" level=debug msg="patching the exiting taskset" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.914Z" level=debug msg="Patch workflowtasksets 200"
time="2024-10-31T13:01:47.914Z" level=info msg=reconcileAgentPod namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.914Z" level=debug msg="Skipped pod creation: already exists" namespace=commanding podName=rinex-ingester-lb5wn-1340600742-agent podPhase=Running workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.914Z" level=info msg=updateAgentPodStatus namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:47.914Z" level=info msg=assessAgentPodStatus namespace=commanding podName=rinex-ingester-lb5wn-1340600742-agent
time="2024-10-31T13:01:47.914Z" level=debug msg="Log changes patch: {\"status\":{\"nodes\":{\"rinex-ingester-lb5wn-1201154155\":{\"boundaryID\":\"rinex-ingester-lb5wn\",\"children\":[\"rinex-ingester-lb5wn-1420861584\"],\"displayName\":\"[1]\",\"finishedAt\":null,\"id\":\"rinex-ingester-lb5wn-1201154155\",\"name\":\"rinex-ingester-lb5wn[1]\",\"nodeFlag\":{},\"phase\":\"Running\",\"startedAt\":\"2024-10-31T13:01:47Z\",\"templateScope\":\"local/rinex-ingester-lb5wn\",\"type\":\"StepGroup\"},\"rinex-ingester-lb5wn-1420861584\":{\"boundaryID\":\"rinex-ingester-lb5wn\",\"displayName\":\"rinex-start\",\"finishedAt\":null,\"id\":\"rinex-ingester-lb5wn-1420861584\",\"inputs\":{\"parameters\":[{\"name\":\"url\",\"value\":\"http://commanding-proxy.commanding.svc.cluster.local:7070/rinex/ingester/start\"},{\"name\":\"request\",\"value\":\"{\\n  \\\"startRequeststring\\\": \\\"Start Request\\\"\\n}\\n\"}]},\"name\":\"rinex-ingester-lb5wn[1].rinex-start\",\"phase\":\"Pending\",\"startedAt\":\"2024-10-31T13:01:47Z\",\"templateName\":\"http-put\",\"templateScope\":\"local/rinex-ingester-lb5wn\",\"type\":\"HTTP\"},\"rinex-ingester-lb5wn-2342179342\":{\"finishedAt\":\"2024-10-31T13:01:47Z\",\"phase\":\"Succeeded\"},\"rinex-ingester-lb5wn-2967083535\":{\"children\":[\"rinex-ingester-lb5wn-1201154155\"]}}}}"
time="2024-10-31T13:01:47.915Z" level=info msg="Workflow to be dehydrated" Workflow Size=3179
time="2024-10-31T13:01:47.920Z" level=debug msg="Patch workflowtasksets/status 200"
time="2024-10-31T13:01:47.924Z" level=debug msg="Patch workflowtasksets 200"
time="2024-10-31T13:01:47.930Z" level=debug msg="Update workflows 200"
time="2024-10-31T13:01:47.931Z" level=info msg="Workflow update successful" namespace=commanding phase=Running resourceVersion=2089 workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:56.372Z" level=debug msg="Syncing all CronWorkflows"
time="2024-10-31T13:01:57.916Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2089 namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.916Z" level=info msg="Task-result reconciliation" namespace=commanding numObjs=0 workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.916Z" level=debug msg="Skipping artifact GC" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.916Z" level=info msg=updateAgentPodStatus namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.917Z" level=info msg=assessAgentPodStatus namespace=commanding podName=rinex-ingester-lb5wn-1340600742-agent
time="2024-10-31T13:01:57.917Z" level=error msg="was unable to obtain node for rinex-ingester-lb5wn-2166136261" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.917Z" level=debug msg="Evaluating node rinex-ingester-lb5wn: template: *v1alpha1.WorkflowStep (main), boundaryID: , stack depth: 0" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.917Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-10-31T13:01:57.917Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-10-31T13:01:57.917Z" level=debug msg="Getting the template by name: main" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (main)"
time="2024-10-31T13:01:57.917Z" level=debug msg="Executing node rinex-ingester-lb5wn of Steps is Running" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=debug msg="Step group node &NodeStatus{ID:rinex-ingester-lb5wn-2342179342,Name:rinex-ingester-lb5wn[0],DisplayName:[0],Type:StepGroup,TemplateName:,TemplateRef:nil,Phase:Succeeded,BoundaryID:rinex-ingester-lb5wn,Message:,StartedAt:2024-10-31 13:01:17 +0000 UTC,FinishedAt:2024-10-31 13:01:47 +0000 UTC,PodIP:,Daemoned:nil,Inputs:nil,Outputs:nil,Children:[rinex-ingester-lb5wn-2967083535],OutboundNodes:[],TemplateScope:local/rinex-ingester-lb5wn,ResourcesDuration:ResourcesDuration{},HostNodeName:,MemoizationStatus:nil,EstimatedDuration:0,SynchronizationStatus:nil,Progress:1/2,NodeFlag:&NodeFlag{Hooked:false,Retried:false,},} already marked completed" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=info msg="SG Outbound nodes of rinex-ingester-lb5wn-2967083535 are [rinex-ingester-lb5wn-2967083535]" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=debug msg="Evaluating node rinex-ingester-lb5wn[1].rinex-start: template: *v1alpha1.WorkflowStep (http-put), boundaryID: rinex-ingester-lb5wn, stack depth: 1" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-put)"
time="2024-10-31T13:01:57.918Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-put)"
time="2024-10-31T13:01:57.918Z" level=debug msg="Getting the template by name: http-put" base="*v1alpha1.Workflow (namespace=commanding,name=rinex-ingester-lb5wn)" tmpl="*v1alpha1.WorkflowStep (http-put)"
time="2024-10-31T13:01:57.918Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:57.918Z" level=debug msg="unresolved is allowed " error=unresolved
time="2024-10-31T13:01:57.918Z" level=debug msg="Executing node rinex-ingester-lb5wn[1].rinex-start of HTTP is Pending" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=info msg="Workflow step group node rinex-ingester-lb5wn-1201154155 not yet completed" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=info msg="TaskSet Reconciliation" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=info msg="Creating TaskSet" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.918Z" level=debug msg="creating new taskset" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.930Z" level=debug msg="Create workflowtasksets 409"
time="2024-10-31T13:01:57.930Z" level=debug msg="patching the exiting taskset" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.936Z" level=debug msg="Patch workflowtasksets 200"
time="2024-10-31T13:01:57.936Z" level=info msg=reconcileAgentPod namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.936Z" level=debug msg="Skipped pod creation: already exists" namespace=commanding podName=rinex-ingester-lb5wn-1340600742-agent podPhase=Running workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.936Z" level=info msg=updateAgentPodStatus namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:01:57.936Z" level=info msg=assessAgentPodStatus namespace=commanding podName=rinex-ingester-lb5wn-1340600742-agent
time="2024-10-31T13:01:57.937Z" level=debug msg="Log changes patch: {\"status\":{\"nodes\":{\"rinex-ingester-lb5wn-1420861584\":{\"finishedAt\":\"2024-10-31T13:01:57Z\",\"outputs\":{\"result\":\"OK Start Request\\n\"},\"phase\":\"Succeeded\"}}}}"
time="2024-10-31T13:01:57.937Z" level=info msg="Workflow to be dehydrated" Workflow Size=3210
time="2024-10-31T13:01:57.942Z" level=debug msg="Patch workflowtasksets/status 200"
time="2024-10-31T13:01:57.950Z" level=debug msg="Patch workflowtasksets 200"
time="2024-10-31T13:01:57.956Z" level=debug msg="Update workflows 200"
time="2024-10-31T13:01:57.956Z" level=info msg="Workflow update successful" namespace=commanding phase=Running resourceVersion=2106 workflow=rinex-ingester-lb5wn
time="2024-10-31T13:02:06.372Z" level=debug msg="Syncing all CronWorkflows"
time="2024-10-31T13:02:07.944Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=2106 namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:02:07.944Z" level=info msg="Task-result reconciliation" namespace=commanding numObjs=0 workflow=rinex-ingester-lb5wn
time="2024-10-31T13:02:07.944Z" level=debug msg="Skipping artifact GC" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:02:07.944Z" level=info msg=updateAgentPodStatus namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:02:07.944Z" level=info msg=assessAgentPodStatus namespace=commanding podName=rinex-ingester-lb5wn-1340600742-agent
time="2024-10-31T13:02:07.944Z" level=error msg="was unable to obtain node for rinex-ingester-lb5wn-2166136261" namespace=commanding workflow=rinex-ingester-lb5wn
time="2024-10-31T13:02:07.944Z" level=debug msg="Evaluating node rinex-ingester-lb5wn: template: *v1alpha1.WorkflowStep (main), boundaryID: , stack depth: 0" namespace=commanding workflow=rinex-ingester-lb5wn

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

julienteisseire commented 3 weeks ago

I tried with 6 HTTP calls, it is now 2min11s workflow duration !

Duration:            2 minutes 11 seconds
Progress:            6/6

STEP                     TEMPLATE  PODNAME  DURATION  MESSAGE
 ✔ rinex-ingester-68whf  main
 ├───✔ rinex-probe-1     http-get
 ├───✔ rinex-start-1     http-put
 ├───✔ rinex-probe-2     http-get
 ├───✔ rinex-start-2     http-put
 ├───✔ rinex-probe-3     http-get
 └───✔ rinex-start-3     http-put

Despite the time to create the agent at the begining : rinex-ingester-z2mvm-1340600742-agent 1/1 Running

And the fact that it is well the same agent which is used during all the workflow execution.

I see this logs :

time="2024-10-31T13:35:06.019Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:35:16.039Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:35:26.069Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:35:36.088Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:35:46.117Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:35:56.133Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:36:06.157Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:36:16.175Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:36:26.201Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:36:36.223Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:36:46.251Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:36:56.272Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf
time="2024-10-31T13:37:06.306Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf

Joibel commented 3 weeks ago

This error message refers to the Argo Workflows usage of the word node (meaning a node in the DAG) rather than the kubernetes usage of the word node.

Apart from the error message in the logs is something not working? I know from your other issues you think workflows is being too slow for your workflow, but I'm trying more about this specific github issue.

julienteisseire commented 3 weeks ago

Thank you for your answer.

I don't understand what is wrong in my workflow description. Why this message about node not obtained ?

It is a basic multi steps workflow (not a DAG) with 6 HTTP calls as follows :

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: rinex-ingester-
  namespace: commanding
  labels:
    workflows.argoproj.io/test: "false"
  annotations:
    workflows.argoproj.io/description: |
      First SWO workflow using HTTP template
    workflows.argoproj.io/version: '>= 3.2.0'
spec:
  entrypoint: main
  serviceAccountName: argo-workflow
  templates:
    - name: main
      steps:
        - - name: rinex-probe-1
            template: http-get
            arguments:
              parameters: [{name: url, value: "http://xxxxxxxxxxxxxxxxx/rinex/ingester/probe"}]
        - - name: rinex-start-1
            template: http-put
            arguments:
              parameters: 
              - name: url
                value: "http://xxxxxxxxxxxxxxxxx/rinex/ingester/start"
              - name: request
                value: |
                  {
                    "startRequeststring": "Start Request"
                  }
        - - name: rinex-probe-2
            template: http-get
            arguments:
              parameters: [{name: url, value: "http://xxxxxxxxxxxxxxxxx/rinex/ingester/probe"}]
        - - name: rinex-start-2
            template: http-put
            arguments:
              parameters: 
              - name: url
                value: "http://xxxxxxxxxxxxxxxxx/rinex/ingester/start"
              - name: request
                value: |
                  {
                    "startRequeststring": "Start Request"
                  }
        - - name: rinex-probe-3
            template: http-get
            arguments:
              parameters: [{name: url, value: "http://xxxxxxxxxxxxxxxxx/rinex/ingester/probe"}]
        - - name: rinex-start-3
            template: http-put
            arguments:
              parameters: 
              - name: url
                value: "http://xxxxxxxxxxxxxxxxx/rinex/ingester/start"
              - name: request
                value: |
                  {
                    "startRequeststring": "Start Request"
                  }
    - name: http-get
      inputs:
        parameters:
          - name: url
      http:
       method: "GET"
       url: "{{inputs.parameters.url}}"
    - name: http-put
      inputs:
        parameters:
          - name: url
          - name: request
      http:
       method: "PUT"
       url: '{{inputs.parameters.url}}'
       body: '{{inputs.parameters.request}}'

So I don't see what node is about in my workflow and how I can fix this error ... ? Indeed, I am also trying to understand why it took 2m11s to start agent POD and then run 6 HTTP calls ... Do you consider it is a normal execution time ? I am looking for information regarding why I have so many 10s waiting phases

time="2024-10-31T14:16:06.829Z" level=debug msg="Syncing all CronWorkflows"
time="2024-10-31T14:16:16.830Z" level=debug msg="Syncing all CronWorkflows"

or all other sleep in the workflow.

So I have really 2 things to deal with :

1. error about node not obtained

time="2024-10-31T13:37:06.306Z" level=error msg="was unable to obtain node for rinex-ingester-68whf-2166136261" namespace=commanding workflow=rinex-ingester-68whf

2. workflow submit duration

Name:                rinex-ingester-68whf
Namespace:           commanding
ServiceAccount:      argo-workflow
Status:              Succeeded
Conditions:
 PodRunning          False
 Completed           True
Created:             Thu Oct 31 14:34:55 +0100 (2 minutes ago)
Started:             Thu Oct 31 14:34:55 +0100 (2 minutes ago)
Finished:            Thu Oct 31 14:37:06 +0100 (now)
Duration:            2 minutes 11 seconds
Progress:            6/6

STEP                     TEMPLATE  PODNAME  DURATION  MESSAGE
 ✔ rinex-ingester-68whf  main
 ├───✔ rinex-probe-1     http-get
 ├───✔ rinex-start-1     http-put
 ├───✔ rinex-probe-2     http-get
 ├───✔ rinex-start-2     http-put
 ├───✔ rinex-probe-3     http-get
 └───✔ rinex-start-3     http-put

If you may help. Thank you

jswxstw commented 3 weeks ago

Duplicated with #12726. I think what you're concerned about is not this error log, but rather the excessive execution time of the workflow, which you can optimize it by adjusting the controller envs: ARGO_AGENT_PATCH_RATE or DEFAULT_REQUEUE_TIME.

julienteisseire commented 2 weeks ago

Indeed performance was my main concern. Thank you.

But I also consider having an error is not normal and it would be great to understand why this error and how to fix it.

Error message : "level=error msg="was unable to obtain node for ..." Workflow : 6 HTTP calls in 6 consecutive steps. Kind with 1 node in local computer.

Thank you.

tooptoop4 commented 1 week ago

i think https://argo-workflows.readthedocs.io/en/latest/http-template/ should be deprecated/removed, having many tiny steps/pods is inefficient. if u just use a simple workflow with single pod like https://github.com/argoproj/argo-workflows/blob/main/examples/retry-container.yaml#L16 and put multiple curl/python requests calls in there it will work quick

argoproj / argo-workflows