argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.82k stars 3.17k forks source link

Cron Workflow pod stuck in pending state if referencing secondary template with error invalid spec.hostname #8352

Closed louisnow closed 2 years ago

louisnow commented 2 years ago

Checklist

* [x] Double-checked my configuration. * [ ] Tested using the latest version. * [x] Used the Emissary executor. ## Summary What happened/what you expected to happen? - Scenario: Cron workflow referencing a WorkflowTemplate. WorkflowTemplate entrypoint template references another template. - Pod stuck in pending state if referencing another template with invalid `spec.hostname` bug. - Does not happen if WorkflowTemplate's entrypoint directly points to the template with the python container/code. - Also does not happen if submitting directly via the submit button in the cron workflow UI - If I change the cron workflow name to something <= 50 characters, it always works. This behaviour feels like some corner case in handling the naming conventions of the pod/workflow. The cron workflow example below has its name as exactly 51 characters. What version are you running? v3.2.9 Emissary ## Diagnostics Paste the smallest workflow that reproduces the bug. We must be able to run the workflow. Workflow Template. The below template will work if you change the entrypoint to `entrypoint: retry-two`. Only difference is that we're referencing another template. ```yaml apiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: retry spec: entrypoint: retry-one templates: - name: retry-one dag: tasks: - name: whalesay-before-process template: retry-two arguments: parameters: - name: message value: "HELLO ALL" - name: retry-two inputs: parameters: - name: message value: "HELLO ALL" script: command: [ python ] image: python:alpine imagePullPolicy: IfNotPresent source: | import time import logging logging.basicConfig(level=logging.DEBUG) logging.debug("Sleeping for 30 sec") time.sleep(30) msg = '{{inputs.parameters.message}}' logging.debug(msg) ``` Cron Workflow template running every minute. This bug only occurs when the cron workflow schedules the workflow and not manually submitting via the UI. ```yaml apiVersion: argoproj.io/v1alpha1 kind: CronWorkflow metadata: name: asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron spec: schedule: "* * * * *" workflowSpec: workflowTemplateRef: name: retry ``` Logs ```bash Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SyncError 23s (x14 over 64s) pod-syncer Error syncing to physical cluster: Pod "asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron--cae34340bf" is invalid: spec.hostname: Invalid value: "asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649531820-": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?') ``` ```bash # Logs from the workflow controller: kubectl logs -n argo deploy/workflow-controller | grep ${workflow} time="2022-04-09T19:25:00.068Z" level=info msg="Processing workflow" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.087Z" level=info msg="Updated phase -> Running" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.101Z" level=info msg="Steps node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 initialized Running" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.101Z" level=info msg="StepGroup node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-3862662396 initialized Running" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.101Z" level=info msg="Pod node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-1673603954 initialized Pending" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg="Created pod: asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300[0].whalesay-before-process (asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-1673603954)" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg="Workflow step group node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-3862662396 not yet completed" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg=reconcileAgentPod namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.135Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=15371561 workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.068Z" level=info msg="Processing workflow" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.070Z" level=info msg="Workflow step group node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-3862662396 not yet completed" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.070Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.070Z" level=info msg=reconcileAgentPod namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 ``` Images First one: Succeeded with entrypoint as python container. Second one: Stuck on pending with error above. Third one: Same code as second one but manually run via submit button form cron workflow UI Screenshot 2022-04-10 at 1 15 13 AM Screenshot 2022-04-10 at 1 10 09 AM ---

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

sarabala1979 commented 2 years ago

@louisnow can you provide the describe of pending Pods?

louisnow commented 2 years ago

Describe of pending pod @sarabala1979

Name:           dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344
Namespace:      default
Priority:       0
Node:           <none>
Labels:         workflows.argoproj.io/completed=false
                workflows.argoproj.io/workflow=dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
Annotations:    workflows.argoproj.io/node-id: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344
                workflows.argoproj.io/node-name: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140[0].whalesay-before-process
Status:         Pending
IP:
IPs:            <none>
Controlled By:  Workflow/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
Init Containers:
  init:
    Image:      ghcr.io/atlanhq/argoexec:v3.2.9
    Port:       <none>
    Host Port:  <none>
    Command:
      argoexec
      init
      --loglevel
      info
    Environment:
      ARGO_POD_NAME:                    dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344 (v1:metadata.name)
      ARGO_CONTAINER_RUNTIME_EXECUTOR:  emissary
      GODEBUG:                          x509ignoreCN=0
      ARGO_WORKFLOW_NAME:               dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
      ARGO_CONTAINER_NAME:              init
      ARGO_TEMPLATE:                    {"name":"retry-two","inputs":{"parameters":[{"name":"message","value":"HELLO ALL"}]},"outputs":{},"metadata":{},"script":{"name":"","image":"python:alpine","command":["python"],"resources":{},"imagePullPolicy":"IfNotPresent","source":"import time\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\nlogging.debug(\"Sleeping for 30 sec\")\ntime.sleep(30)\nmsg = 'HELLO ALL'\nlogging.debug(msg)\n"},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"s3.ap-south-1.amazonaws.com","bucket":"atlan-vcluster-louis-w333xv88y910","region":"ap-south-1","insecure":false,"useSDKCreds":true,"key":"argo-artifacts/default/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344"}}}
      ARGO_INCLUDE_SCRIPT_OUTPUT:       false
      ARGO_DEADLINE:                    2022-04-13T05:09:00Z
    Mounts:
      /argo/staging from argo-staging (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-llltl (ro)
Containers:
  wait:
    Image:      ghcr.io/atlanhq/argoexec:v3.2.9
    Port:       <none>
    Host Port:  <none>
    Command:
      argoexec
      wait
      --loglevel
      info
    Environment:
      ARGO_POD_NAME:                    dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344 (v1:metadata.name)
      ARGO_CONTAINER_RUNTIME_EXECUTOR:  emissary
      GODEBUG:                          x509ignoreCN=0
      ARGO_WORKFLOW_NAME:               dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
      ARGO_CONTAINER_NAME:              wait
      ARGO_TEMPLATE:                    {"name":"retry-two","inputs":{"parameters":[{"name":"message","value":"HELLO ALL"}]},"outputs":{},"metadata":{},"script":{"name":"","image":"python:alpine","command":["python"],"resources":{},"imagePullPolicy":"IfNotPresent","source":"import time\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\nlogging.debug(\"Sleeping for 30 sec\")\ntime.sleep(30)\nmsg = 'HELLO ALL'\nlogging.debug(msg)\n"},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"s3.ap-south-1.amazonaws.com","bucket":"atlan-vcluster-louis-w333xv88y910","region":"ap-south-1","insecure":false,"useSDKCreds":true,"key":"argo-artifacts/default/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344"}}}
      ARGO_INCLUDE_SCRIPT_OUTPUT:       false
      ARGO_DEADLINE:                    2022-04-13T05:09:00Z
    Mounts:
      /mainctrfs/argo/staging from argo-staging (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-llltl (ro)
  main:
    Image:      python:alpine
    Port:       <none>
    Host Port:  <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --
      python
    Args:
      /argo/staging/script
    Environment:
      ARGO_CONTAINER_NAME:         main
      ARGO_TEMPLATE:               {"name":"retry-two","inputs":{"parameters":[{"name":"message","value":"HELLO ALL"}]},"outputs":{},"metadata":{},"script":{"name":"","image":"python:alpine","command":["python"],"resources":{},"imagePullPolicy":"IfNotPresent","source":"import time\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\nlogging.debug(\"Sleeping for 30 sec\")\ntime.sleep(30)\nmsg = 'HELLO ALL'\nlogging.debug(msg)\n"},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"s3.ap-south-1.amazonaws.com","bucket":"atlan-vcluster-louis-w333xv88y910","region":"ap-south-1","insecure":false,"useSDKCreds":true,"key":"argo-artifacts/default/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344"}}}
      ARGO_INCLUDE_SCRIPT_OUTPUT:  false
      ARGO_DEADLINE:               2022-04-13T05:09:00Z
    Mounts:
      /argo/staging from argo-staging (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-llltl (ro)
Volumes:
  var-run-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  argo-staging:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-llltl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From        Message
  ----     ------     ----                ----        -------
  Warning  SyncError  39s (x14 over 80s)  pod-syncer  Error syncing to physical cluster: Pod "dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron--72ea418fd1" is invalid: spec.hostname: Invalid value: "dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')
sarabala1979 commented 2 years ago

@louisnow looks like your hostname string is invalid end with - hostname limits 64 chars. dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140- is 63 chars. Do you have a mutation webhook to update the pod hostname?

Can you try to have small/less cronworkflow name?

louisnow commented 2 years ago

Hmm, we use https://loft.sh for our k8s cluster. Thanks for the pointer!