Closed louisnow closed 2 years ago
@louisnow can you provide the describe of pending Pods
?
Describe of pending pod @sarabala1979
Name: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344
Namespace: default
Priority: 0
Node: <none>
Labels: workflows.argoproj.io/completed=false
workflows.argoproj.io/workflow=dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
Annotations: workflows.argoproj.io/node-id: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344
workflows.argoproj.io/node-name: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140[0].whalesay-before-process
Status: Pending
IP:
IPs: <none>
Controlled By: Workflow/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
Init Containers:
init:
Image: ghcr.io/atlanhq/argoexec:v3.2.9
Port: <none>
Host Port: <none>
Command:
argoexec
init
--loglevel
info
Environment:
ARGO_POD_NAME: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344 (v1:metadata.name)
ARGO_CONTAINER_RUNTIME_EXECUTOR: emissary
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
ARGO_CONTAINER_NAME: init
ARGO_TEMPLATE: {"name":"retry-two","inputs":{"parameters":[{"name":"message","value":"HELLO ALL"}]},"outputs":{},"metadata":{},"script":{"name":"","image":"python:alpine","command":["python"],"resources":{},"imagePullPolicy":"IfNotPresent","source":"import time\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\nlogging.debug(\"Sleeping for 30 sec\")\ntime.sleep(30)\nmsg = 'HELLO ALL'\nlogging.debug(msg)\n"},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"s3.ap-south-1.amazonaws.com","bucket":"atlan-vcluster-louis-w333xv88y910","region":"ap-south-1","insecure":false,"useSDKCreds":true,"key":"argo-artifacts/default/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344"}}}
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 2022-04-13T05:09:00Z
Mounts:
/argo/staging from argo-staging (rw)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-llltl (ro)
Containers:
wait:
Image: ghcr.io/atlanhq/argoexec:v3.2.9
Port: <none>
Host Port: <none>
Command:
argoexec
wait
--loglevel
info
Environment:
ARGO_POD_NAME: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344 (v1:metadata.name)
ARGO_CONTAINER_RUNTIME_EXECUTOR: emissary
GODEBUG: x509ignoreCN=0
ARGO_WORKFLOW_NAME: dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140
ARGO_CONTAINER_NAME: wait
ARGO_TEMPLATE: {"name":"retry-two","inputs":{"parameters":[{"name":"message","value":"HELLO ALL"}]},"outputs":{},"metadata":{},"script":{"name":"","image":"python:alpine","command":["python"],"resources":{},"imagePullPolicy":"IfNotPresent","source":"import time\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\nlogging.debug(\"Sleeping for 30 sec\")\ntime.sleep(30)\nmsg = 'HELLO ALL'\nlogging.debug(msg)\n"},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"s3.ap-south-1.amazonaws.com","bucket":"atlan-vcluster-louis-w333xv88y910","region":"ap-south-1","insecure":false,"useSDKCreds":true,"key":"argo-artifacts/default/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344"}}}
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 2022-04-13T05:09:00Z
Mounts:
/mainctrfs/argo/staging from argo-staging (rw)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-llltl (ro)
main:
Image: python:alpine
Port: <none>
Host Port: <none>
Command:
/var/run/argo/argoexec
emissary
--
python
Args:
/argo/staging/script
Environment:
ARGO_CONTAINER_NAME: main
ARGO_TEMPLATE: {"name":"retry-two","inputs":{"parameters":[{"name":"message","value":"HELLO ALL"}]},"outputs":{},"metadata":{},"script":{"name":"","image":"python:alpine","command":["python"],"resources":{},"imagePullPolicy":"IfNotPresent","source":"import time\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\nlogging.debug(\"Sleeping for 30 sec\")\ntime.sleep(30)\nmsg = 'HELLO ALL'\nlogging.debug(msg)\n"},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"s3.ap-south-1.amazonaws.com","bucket":"atlan-vcluster-louis-w333xv88y910","region":"ap-south-1","insecure":false,"useSDKCreds":true,"key":"argo-artifacts/default/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140/dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-2063582344"}}}
ARGO_INCLUDE_SCRIPT_OUTPUT: false
ARGO_DEADLINE: 2022-04-13T05:09:00Z
Mounts:
/argo/staging from argo-staging (rw)
/var/run/argo from var-run-argo (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-llltl (ro)
Volumes:
var-run-argo:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
argo-staging:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-llltl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SyncError 39s (x14 over 80s) pod-syncer Error syncing to physical cluster: Pod "dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron--72ea418fd1" is invalid: spec.hostname: Invalid value: "dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')
@louisnow looks like your hostname string is invalid end with -
hostname limits 64 chars. dsq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649740140-
is 63 chars. Do you have a mutation webhook to update the pod hostname?
Can you try to have small/less cronworkflow name?
Hmm, we use https://loft.sh for our k8s cluster. Thanks for the pointer!
Checklist
* [x] Double-checked my configuration. * [ ] Tested using the latest version. * [x] Used the Emissary executor. ## Summary What happened/what you expected to happen? - Scenario: Cron workflow referencing a WorkflowTemplate. WorkflowTemplate entrypoint template references another template. - Pod stuck in pending state if referencing another template with invalid `spec.hostname` bug. - Does not happen if WorkflowTemplate's entrypoint directly points to the template with the python container/code. - Also does not happen if submitting directly via the submit button in the cron workflow UI - If I change the cron workflow name to something <= 50 characters, it always works. This behaviour feels like some corner case in handling the naming conventions of the pod/workflow. The cron workflow example below has its name as exactly 51 characters. What version are you running? v3.2.9 Emissary ## Diagnostics Paste the smallest workflow that reproduces the bug. We must be able to run the workflow. Workflow Template. The below template will work if you change the entrypoint to `entrypoint: retry-two`. Only difference is that we're referencing another template. ```yaml apiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: retry spec: entrypoint: retry-one templates: - name: retry-one dag: tasks: - name: whalesay-before-process template: retry-two arguments: parameters: - name: message value: "HELLO ALL" - name: retry-two inputs: parameters: - name: message value: "HELLO ALL" script: command: [ python ] image: python:alpine imagePullPolicy: IfNotPresent source: | import time import logging logging.basicConfig(level=logging.DEBUG) logging.debug("Sleeping for 30 sec") time.sleep(30) msg = '{{inputs.parameters.message}}' logging.debug(msg) ``` Cron Workflow template running every minute. This bug only occurs when the cron workflow schedules the workflow and not manually submitting via the UI. ```yaml apiVersion: argoproj.io/v1alpha1 kind: CronWorkflow metadata: name: asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron spec: schedule: "* * * * *" workflowSpec: workflowTemplateRef: name: retry ``` Logs ```bash Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SyncError 23s (x14 over 64s) pod-syncer Error syncing to physical cluster: Pod "asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron--cae34340bf" is invalid: spec.hostname: Invalid value: "asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649531820-": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?') ``` ```bash # Logs from the workflow controller: kubectl logs -n argo deploy/workflow-controller | grep ${workflow} time="2022-04-09T19:25:00.068Z" level=info msg="Processing workflow" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.087Z" level=info msg="Updated phase -> Running" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.101Z" level=info msg="Steps node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 initialized Running" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.101Z" level=info msg="StepGroup node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-3862662396 initialized Running" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.101Z" level=info msg="Pod node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-1673603954 initialized Pending" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg="Created pod: asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300[0].whalesay-before-process (asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-1673603954)" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg="Workflow step group node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-3862662396 not yet completed" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.112Z" level=info msg=reconcileAgentPod namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:00.135Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=15371561 workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.068Z" level=info msg="Processing workflow" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.070Z" level=info msg="Workflow step group node asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300-3862662396 not yet completed" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.070Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 time="2022-04-09T19:25:10.070Z" level=info msg=reconcileAgentPod namespace=default workflow=asq-94ee6512-9439-4c3e-a57c-958845e541b4-cbabf-cron-1649532300 ``` Images First one: Succeeded with entrypoint as python container. Second one: Stuck on pending with error above. Third one: Same code as second one but manually run via submit button form cron workflow UI ---Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.