argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.11k stars 3.21k forks source link

Pod `StartError` seems to be ignored #4011

Closed alexec closed 4 years ago

alexec commented 4 years ago

Summary

Pod failed to start - workflows should have errored. But remained Running

Diagnostics

What version of Argo Workflows are you running? master

    loops-sequence-vvb9d-4095976472:
      id: loops-sequence-vvb9d-4095976472
      name: 'loops-sequence-vvb9d[0].sequence-count(3:3)'
      displayName: 'sequence-count(3:3)'
      type: Pod
      templateName: echo
      templateScope: namespaced/loops-sequence
      phase: Running
      boundaryID: loops-sequence-vvb9d
      startedAt: '2020-09-13T21:15:01Z'
      finishedAt: null
      estimatedDuration: 19000000000
      inputs:
        parameters:
          - name: msg
            value: '3'
      hostNodeName: k3d-k3s-default-server
apiVersion: v1
kind: Pod
metadata:
  annotations:
    workflows.argoproj.io/node-name: loops-sequence-vvb9d[0].sequence-count(3:3)
    workflows.argoproj.io/template: '{"name":"echo","arguments":{},"inputs":{"parameters":[{"name":"msg","value":"3"}]},"outputs":{},"metadata":{},"container":{"name":"","image":"alpine:latest","command":["echo","3"],"resources":{}},"archiveLocation":{"archiveLogs":true,"s3":{"endpoint":"minio:9000","bucket":"my-bucket","insecure":true,"accessKeySecret":{"name":"my-minio-cred","key":"accesskey"},"secretKeySecret":{"name":"my-minio-cred","key":"secretkey"},"key":"loops-sequence-vvb9d/loops-sequence-vvb9d-4095976472"}}}'
  labels:
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/workflow: loops-sequence-vvb9d
  name: loops-sequence-vvb9d-4095976472
  namespace: argo
  ownerReferences:
    - apiVersion: argoproj.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: Workflow
      name: loops-sequence-vvb9d
      uid: 0d676e02-8145-4b5c-a3a6-dcadb76a7841
spec:
  containers:
    - command:
        - argoexec
        - wait
      env:
        - name: ARGO_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: ARGO_CONTAINER_RUNTIME_EXECUTOR
          value: pns
      image: argoproj/argoexec:latest
      imagePullPolicy: IfNotPresent
      name: wait
      resources:
        limits:
          cpu: 500m
          memory: 128Mi
        requests:
          cpu: 100m
          memory: 64Mi
      securityContext:
        capabilities:
          add:
            - SYS_PTRACE
            - SYS_CHROOT
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /argo/podmetadata
          name: podmetadata
        - mountPath: /argo/secret/my-minio-cred
          name: my-minio-cred
          readOnly: true
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: default-token-tf5qr
          readOnly: true
    - command:
        - echo
        - "3"
      image: alpine:latest
      imagePullPolicy: Always
      name: main
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: default-token-tf5qr
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: k3d-k3s-default-server
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  shareProcessNamespace: true
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  volumes:
    - downwardAPI:
        defaultMode: 420
        items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.annotations
            path: annotations
      name: podmetadata
    - name: my-minio-cred
      secret:
        defaultMode: 420
        items:
          - key: accesskey
            path: accesskey
          - key: secretkey
            path: secretkey
        secretName: my-minio-cred
    - name: default-token-tf5qr
      secret:
        defaultMode: 420
        secretName: default-token-tf5qr
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      message: 'containers with unready status: [main]'
      reason: ContainersNotReady
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      message: 'containers with unready status: [main]'
      reason: ContainersNotReady
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2020-09-13T21:15:01Z"
      status: "True"
      type: PodScheduled
  containerStatuses:
    - containerID: containerd://8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76
      image: docker.io/library/alpine:latest
      imageID: docker.io/library/alpine@sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
      lastState: {}
      name: main
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: containerd://8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76
          exitCode: 128
          finishedAt: "2020-09-13T21:15:45Z"
          message: 'failed to create containerd task: failed to start io pipe copy:
          unable to copy pipes: containerd-shim: opening w/o fifo "/run/k3s/containerd/io.containerd.grpc.v1.cri/containers/8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76/io/236879711/8d2d58a13a2a9147281fe0b157796b001d7ea02248f1f503eebfb46257bf0f76-stdout"
          failed: context deadline exceeded'
          reason: StartError
          startedAt: "1970-01-01T00:00:00Z"
    - containerID: containerd://35a12933d32e2150e5c62940f19a7b9cc57b1fbcdc11a343fe2f0f7b306694d0
      image: docker.io/argoproj/argoexec:latest
      imageID: sha256:76c472387dfe8d5cb8126b494dbe90ae9c59b9389db2761bf75db0a2d60cfbae
      lastState: {}
      name: wait
      ready: true
      restartCount: 0
      started: true
      state:
        running:
          startedAt: "2020-09-13T21:15:06Z"
  hostIP: 172.18.0.2
  phase: Running
  podIP: 10.42.0.204
  podIPs:
    - ip: 10.42.0.204
  qosClass: Burstable
  startTime: "2020-09-13T21:15:01Z"

Message from the maintainers:

Impacted by this bug? Give it a šŸ‘. We prioritise the issues with the most šŸ‘.

alexec commented 4 years ago

Fix operator.go#1073:

                log.Infof("Processing ready daemon pod: %v", pod.ObjectMeta.SelfLink)
            }

            for _, s := range append(pod.Status.InitContainerStatuses, pod.Status.ContainerStatuses...) {
                t := s.State.Terminated
                if t != nil && t.ExitCode > 0 {
                    newPhase, message = inferFailedReason(pod)
                }
            }
alexec commented 4 years ago

This has not been seen in the wild. Maybe a rare K3S only issue, for example. Fix might create new bugs.