argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.98k stars 3.19k forks source link

Random failures with message "Error: No such container:path: ... This does not look like a tar archive tar" #6352

Closed Bobgy closed 3 years ago

Bobgy commented 3 years ago

Downstream KFP issue: https://github.com/kubeflow/pipelines/issues/5943

Summary

What happened/what you expected to happen?

When I run many workflows in a cluster, some workflows fail at random (5~10%) steps with errors like the following message:

This step is in Error state with this message: Error (exit code 1): Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar: Exiting with failure status due to previous errors

The error can happen at any step randomly.

I'd expect argo can identify root cause of this and return a meaningful error message.

Diagnostics

👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:

What Kubernetes provider are you using? GKE (I'm using Kubernetes 1.18 with Container-optimized OS with docker nodes)

What version of Argo Workflows are you running? v3.1.1

What executor are you running? Docker

Did this work in a previous version? I.e. is it a regression? I'm not seeing this before in v2.12.x and v3.0.x

Investigations

After reading through the detailed logs, here's my understanding of the problem:

  1. Argo creates a Pod A.
  2. There are too many Pods in the same node, kubelet decided to kill Pod A to leave enough resources for other Pods. (see logs below "Killing unwanted pod")
  3. Therefore, all the containers receive a TERM signal at the same time. (this is probably wrong)
  4. wait container tries to get output artifacts/parameters before main container finishes -- see detailed logs below (wait container finished at 2021-07-16T08:55:36Z, earlier than main container finished at 2021-07-16T08:55:38Z).
  5. wait container failed to get outputs from main container: sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz
  6. got final error message Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar

ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Suggestions

I don't understand wait container details. For the final error message, I think we can improve by fixing the sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz command.

Note, this script's exit code is always the second command gzip > /tmp/...... It doesn't matter the first command failed or not. This is a problem of shell when using pipe.

One workaround is using go to pipe command outputs: https://stackoverflow.com/a/10781582/8745218 and check exit code of both commands properly, so that we'll return error message of the docker cp command when it fails. I think the file /tmp/outputs/Output/data doesn't exist at the time this command was run.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

Bobgy commented 3 years ago

Detailed logs

All container statuses:

  initContainerStatuses:
    - name: kfp-launcher
      state:
        terminated:
          exitCode: 0
          reason: Completed
          startedAt: '2021-07-16T08:55:13Z'
          finishedAt: '2021-07-16T08:55:14Z'
          containerID: >-
            docker://3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41
      lastState: {}
      ready: true
      restartCount: 0
      image: 'sha256:d9d584d65d567192276524136625d6d7f6c25e3cc6672e2e4e931ce9347360be'
      imageID: >-
        docker-pullable://gcr.io/gongyuan-dev/v2-sample-test/kfp-launcher@sha256:13068df2403e854d2d44aa2c66ed318b99d89ec35abfdef3d1b47a102b07f6d1
      containerID: >-
        docker://3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41
  containerStatuses:
    - name: main
      state:
        terminated:
          exitCode: 0
          reason: Completed
          startedAt: '2021-07-16T08:55:35Z'
          finishedAt: '2021-07-16T08:55:38Z'
          containerID: >-
            docker://0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb
      lastState: {}
      ready: false
      restartCount: 0
      image: 'python:3.7'
      imageID: >-
        docker-pullable://python@sha256:c9b155d64106b7c939cb82396251401418dbdd6ccf51a90c9ea4ede13162fa35
      containerID: >-
        docker://0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb
      started: false
    - name: wait
      state:
        terminated:
          exitCode: 1
          reason: Error
          message: >-
            path /tmp/outputs/Output/data does not exist in archive
            /tmp/argo/outputs/artifacts/training-op-Output.tgz
          startedAt: '2021-07-16T08:55:30Z'
          finishedAt: '2021-07-16T08:55:36Z'
          containerID: >-
            docker://cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17
      lastState: {}
      ready: false
      restartCount: 0
      image: 'gcr.io/ml-pipeline/argoexec:v3.1.1-patch-license-compliance'
      imageID: >-
        docker-pullable://gcr.io/ml-pipeline/argoexec@sha256:a27132f35207614ac684dc69e424ccc461ec534c099c28f7ff715b5d035144b7
      containerID: >-
        docker://cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17
      started: false

Kubelet logs related to this pod:

I0716 08:54:59.189645 1432 kubelet.go:1939] SyncLoop (ADD, "api"): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" I0716 08:54:59.248867 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "docker-sock" (UniqueName: "kubernetes.io/host-path/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-docker-sock") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.248930 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "mlpipeline-minio-artifact" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-mlpipeline-minio-artifact") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.249018 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kfp-launcher" (UniqueName: "kubernetes.io/empty-dir/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-kfp-launcher") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.249097 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "podmetadata" (UniqueName: "kubernetes.io/downward-api/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-podmetadata") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.249297 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "pipeline-runner-token-qrtlb" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-pipeline-runner-token-qrtlb") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350010 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "podmetadata" (UniqueName: "kubernetes.io/downward-api/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-podmetadata") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350088 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "pipeline-runner-token-qrtlb" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-pipeline-runner-token-qrtlb") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350127 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "docker-sock" (UniqueName: "kubernetes.io/host-path/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-docker-sock") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350148 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "mlpipeline-minio-artifact" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-mlpipeline-minio-artifact") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350256 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "kfp-launcher" (UniqueName: "kubernetes.io/empty-dir/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-kfp-launcher") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350586 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "docker-sock" (UniqueName: "kubernetes.io/host-path/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-docker-sock") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350798 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "kfp-launcher" (UniqueName: "kubernetes.io/empty-dir/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-kfp-launcher") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.363995 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "podmetadata" (UniqueName: "kubernetes.io/downward-api/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-podmetadata") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.364539 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "mlpipeline-minio-artifact" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-mlpipeline-minio-artifact") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.365499 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "pipeline-runner-token-qrtlb" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-pipeline-runner-token-qrtlb") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.646342 1432 kuberuntime_manager.go:434] No sandbox for pod "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" can be found. Need to start a new one I0716 08:55:23.167460 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerStarted", Data:"ca400df6509ccb22cb04437ca486590c6b8e5833b4d9955fa6629814abde6ce7"} E0716 08:55:26.484429 1432 kuberuntime_manager.go:975] getPodContainerStatuses for pod "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" failed: rpc error: code = Unknown desc = Error: No such container: cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 E0716 08:55:32.496413 1432 kuberuntime_manager.go:975] getPodContainerStatuses for pod "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" failed: rpc error: code = Unknown desc = Error: No such container: 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb I0716 08:55:36.075101 1432 kubelet.go:1946] SyncLoop (UPDATE, "api"): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" I0716 08:55:53.324527 1432 kubelet_pods.go:1232] Killing unwanted pod "pipeline-with-resource-spec-lzjrz-1397752094" I0716 08:56:03.453733 1432 kubelet.go:1946] SyncLoop (UPDATE, "api"): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" I0716 08:56:09.741854 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41"} I0716 08:56:09.742010 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb"} I0716 08:56:09.742077 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17"} I0716 08:56:23.064280 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"ca400df6509ccb22cb04437ca486590c6b8e5833b4d9955fa6629814abde6ce7"}

Logs for wait container:

time="2021-07-16T08:55:30.534Z" level=info msg="Starting Workflow Executor" executorType=docker version=untagged time="2021-07-16T08:55:30.538Z" level=info msg="Creating a docker executor" time="2021-07-16T08:55:30.539Z" level=info msg="Executor initialized" includeScriptOutput=false namespace=kubeflow podName=pipeline-with-resource-spec-lzjrz-1397752094 template="{\"name\":\"training-op\",\"inputs\":{\"parameters\":[{\"name\":\"n\",\"value\":\"11234567\"},{\"name\":\"pipeline-name\",\"value\":\"pipeline/pipeline-with-resource-spec\"},{\"name\":\"pipeline-output-directory\",\"value\":\"gs://gongyuan-dev/v2-sample-test/data/samples_config-loop-item\"}]},\"outputs\":{\"artifacts\":[{\"name\":\"training-op-Output\",\"path\":\"/tmp/outputs/Output/data\"}]},\"metadata\":{\"annotations\":{\"pipelines.kubeflow.org/arguments.parameters\":\"{\\"n\\": \\"11234567\\"}\",\"pipelines.kubeflow.org/component_ref\":\"{}\",\"pipelines.kubeflow.org/max_cache_staleness\":\"P0D\",\"pipelines.kubeflow.org/v2_component\":\"true\",\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\",\"pipelines.kubeflow.org/enable_caching\":\"false\",\"pipelines.kubeflow.org/kfp_sdk_version\":\"1.6.4\",\"pipelines.kubeflow.org/pipeline-sdk-type\":\"kfp\",\"pipelines.kubeflow.org/v2_component\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"python:3.7\",\"command\":[\"/kfp-launcher/launch\",\"--mlmd_server_address\",\"$(METADATA_GRPC_SERVICE_HOST)\",\"--mlmd_server_port\",\"$(METADATA_GRPC_SERVICE_PORT)\",\"--runtime_info_json\",\"$(KFP_V2_RUNTIME_INFO)\",\"--container_image\",\"$(KFP_V2_IMAGE)\",\"--task_name\",\"training-op\",\"--pipeline_name\",\"pipeline/pipeline-with-resource-spec\",\"--run_id\",\"$(KFP_RUN_ID)\",\"--run_resource\",\"workflows.argoproj.io/$(WORKFLOW_ID)\",\"--namespace\",\"$(KFP_NAMESPACE)\",\"--pod_name\",\"$(KFP_POD_NAME)\",\"--pod_uid\",\"$(KFP_POD_UID)\",\"--pipeline_root\",\"gs://gongyuan-dev/v2-sample-test/data/samples_config-loop-item\",\"--enable_caching\",\"$(ENABLE_CACHING)\",\"--\",\"n=11234567\",\"--\"],\"args\":[\"sh\",\"-ec\",\"program_path=$(mktemp)\nprintf \\"%s\\" \\"$0\\" \u003e \\"$program_path\\"\npython3 -u \\"$program_path\\" \\"$@\\"\n\",\"def training_op(n):\n # quickly allocate a lot of memory to verify memory is enough\n a = [i for i in range(n)]\n return len(a)\n\ndef _serialize_int(int_value: int) -\u003e str:\n if isinstance(int_value, str):\n return int_value\n if not isinstance(int_value, int):\n raise TypeError('Value \\"{}\\" has type \\"{}\\" instead of int.'.format(str(int_value), str(type(int_value))))\n return str(int_value)\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Training op', description='')\n_parser.add_argument(\\"--n\\", dest=\\"n\\", type=int, required=True, default=argparse.SUPPRESS)\n_parser.add_argument(\\"----output-paths\\", dest=\\"_output_paths\\", type=str, nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files = _parsed_args.pop(\\"_output_paths\\", [])\n\n_outputs = training_op(**_parsed_args)\n\n_outputs = [_outputs]\n\n_output_serializers = [\n _serialize_int,\n\n]\n\nimport os\nfor idx, output_file in enumerate(_output_files):\n try:\n os.makedirs(os.path.dirname(output_file))\n except OSError:\n pass\n with open(output_file, 'w') as f:\n f.write(_output_serializersidx)\n\",\"--n\",\"{{$.inputs.parameters['n']}}\",\"----output-paths\",\"{{$.outputs.parameters['Output'].output_file}}\"],\"envFrom\":[{\"configMapRef\":{\"name\":\"metadata-grpc-configmap\",\"optional\":true}}],\"env\":[{\"name\":\"KFP_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.name\"}}},{\"name\":\"KFP_POD_UID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.uid\"}}},{\"name\":\"KFP_NAMESPACE\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.namespace\"}}},{\"name\":\"WORKFLOW_ID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['workflows.argoproj.io/workflow']\"}}},{\"name\":\"KFP_RUN_ID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['pipeline/runid']\"}}},{\"name\":\"ENABLE_CACHING\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['pipelines.kubeflow.org/enable_caching']\"}}},{\"name\":\"KFP_V2_IMAGE\",\"value\":\"python:3.7\"},{\"name\":\"KFP_V2_RUNTIME_INFO\",\"value\":\"{\\"inputParameters\\": {\\"n\\": {\\"type\\": \\"INT\\"}}, \\"inputArtifacts\\": {}, \\"outputParameters\\": {\\"Output\\": {\\"type\\": \\"INT\\", \\"path\\": \\"/tmp/outputs/Output/data\\"}}, \\"outputArtifacts\\": {}}\"}],\"resources\":{\"limits\":{\"cpu\":\"1\",\"memory\":\"500Mi\"}},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]},\"volumes\":[{\"name\":\"kfp-launcher\"}],\"initContainers\":[{\"name\":\"kfp-launcher\",\"image\":\"gcr.io/gongyuan-dev/v2-sample-test/kfp-launcher@sha256:13068df2403e854d2d44aa2c66ed318b99d89ec35abfdef3d1b47a102b07f6d1\",\"command\":[\"/bin/mount_launcher.sh\"],\"resources\":{},\"mirrorVolumeMounts\":true}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/pipeline-with-resource-spec-lzjrz/2021/07/16/pipeline-with-resource-spec-lzjrz-1397752094\"}}}" version="&Version{Version:untagged,BuildDate:2021-07-13T12:00:17Z,GitCommit:e72989db843bf092811274ed9c84abad18aaccc5,GitTag:untagged,GitTreeState:clean,GoVersion:go1.15.7,Compiler:gc,Platform:linux/amd64,}" time="2021-07-16T08:55:30.539Z" level=info msg="Starting annotations monitor" time="2021-07-16T08:55:30.540Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:30.541Z" level=info msg="Starting deadline monitor" time="2021-07-16T08:55:30.683Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Created {0 63762022523 }}]" time="2021-07-16T08:55:30.683Z" level=info msg="mapped container name \"wait\" to container ID \"cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17\" (created at 2021-07-16 08:55:23 +0000 UTC, status Created)" time="2021-07-16T08:55:30.683Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:31.683Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:31.738Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:31.738Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:32.738Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:32.798Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Created {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:32.799Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:32.799Z" level=info msg="mapped container name \"main\" to container ID \"0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb\" (created at 2021-07-16 08:55:31 +0000 UTC, status Created)" time="2021-07-16T08:55:33.540Z" level=info msg="docker wait 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb" time="2021-07-16T08:55:33.610Z" level=info msg="Main container completed" time="2021-07-16T08:55:33.610Z" level=info msg="No Script output reference in workflow. Capturing script output ignored" time="2021-07-16T08:55:33.610Z" level=info msg="Saving logs" time="2021-07-16T08:55:33.610Z" level=info msg="[docker logs 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb]" time="2021-07-16T08:55:33.799Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:33.850Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Created {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:33.850Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:34.851Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:34.899Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Created {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:34.899Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:35.783Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/pipeline-with-resource-spec-lzjrz/2021/07/16/pipeline-with-resource-spec-lzjrz-1397752094/main.log" time="2021-07-16T08:55:35.783Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials" time="2021-07-16T08:55:35.783Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-with-resource-spec-lzjrz/2021/07/16/pipeline-with-resource-spec-lzjrz-1397752094/main.log)" time="2021-07-16T08:55:35.899Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:35.905Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log time="2021-07-16T08:55:35.905Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log" time="2021-07-16T08:55:35.905Z" level=info msg="No output parameters" time="2021-07-16T08:55:35.905Z" level=info msg="Saving output artifacts" time="2021-07-16T08:55:35.905Z" level=info msg="Staging artifact: training-op-Output" time="2021-07-16T08:55:35.905Z" level=info msg="Copying /tmp/outputs/Output/data from container base image layer to /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:35.905Z" level=info msg="Archiving main:/tmp/outputs/Output/data to /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:35.906Z" level=info msg="sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:35.958Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Up {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:35.958Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:36.028Z" level=warning msg="path /tmp/outputs/Output/data does not exist in archive /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:36.028Z" level=error msg="executor error: path /tmp/outputs/Output/data does not exist in archive /tmp/argo/outputs/artifacts/training-op-Output.tgz\ngithub.com/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngithub.com/argoproj/argo-workflows/v3/errors.Errorf\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:55\ngithub.com/argoproj/argo-workflows/v3/workflow/executor/docker.(DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/docker/docker.go:88\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:402\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:290\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:276\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:62\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:18\ngithub.com/spf13/cobra.(Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngithub.com/spf13/cobra.(Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374" time="2021-07-16T08:55:36.028Z" level=info msg="Annotating pod with output" time="2021-07-16T08:55:36.068Z" level=info msg="Patch pods 200" time="2021-07-16T08:55:36.080Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:36.237Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Up {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:36.237Z" level=info msg="Killing sidecars [\"kfp-launcher\"]" time="2021-07-16T08:55:36.252Z" level=info msg="Get pods 200" time="2021-07-16T08:55:36.258Z" level=info msg="zero container IDs, assuming all containers have exited successfully" time="2021-07-16T08:55:36.258Z" level=info msg="Alloc=10431 TotalAlloc=16471 Sys=74065 NumGC=5 Goroutines=12" time="2021-07-16T08:55:36.369Z" level=fatal msg="path /tmp/outputs/Output/data does not exist in archive /tmp/argo/outputs/artifacts/training-op-Output.tgz\ngithub.com/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngithub.com/argoproj/argo-workflows/v3/errors.Errorf\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:55\ngithub.com/argoproj/argo-workflows/v3/workflow/executor/docker.(DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/docker/docker.go:88\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:402\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:290\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:276\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:62\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:18\ngithub.com/spf13/cobra.(Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngithub.com/spf13/cobra.(Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"

Bobgy commented 3 years ago

FYI, because Kubeflow Pipelines want to use emissary executor so much. I won't further invest in docker executor now. Feel free to close or leave it here to see if others hit the same problems.

Bobgy commented 3 years ago

I'm resuming working on this, because it was a little rush for KFP to default to emissary right now. Our plan is to keep supporting docker executor for 1 version and let people test out emissary executor on demand.

Bobgy commented 3 years ago

I can now stably reproduce this problem on GKE by running a lot of workflows in parallel, some will fail with this error.

I added some extra logging to argo executor and deployed to my cluster, got the following logs on a failed task:

time="2021-08-04T09:30:58.563Z" level=info msg="docker wait 5ac4b5f49f81e3468d593ee4aa5a987da8139d6a8af61613f0d5579055684716"

time="2021-08-04T09:30:58.759Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=metrics-visualization-pipeline-b5f6k-2986647169"

time="2021-08-04T09:30:58.760Z" level=info msg="listed containers" containers="map[kfp-launcher:{efb685efc742193570355cdb14eeeae71710d489b8a61bea4d07ffd41782b3d6 Exited {0 63763666237 }} main:{5ac4b5f49f81e3468d593ee4aa5a987da8139d6a8af61613f0d5579055684716 Created {0 63763666256 }} wait:{4bd0f867061e5ac69f35fc5e5e1d3c21a7c369d6ee112b3b862b4cbf1db79c1e Up {0 63763666253 }}]"

Bobgy commented 3 years ago

It seems that when docker wait was run, the container was in "Created" state. docker wait succeeded almost right away, but the container is still in "Created" state when docker wait succeeded.

I tried to verify this behavior locally on my laptop by running docker wait in a loop waiting for a container to be created, then create a container alongside it.

# In one terminal
$ while true; do docker wait tf && echo "done" && break; done
# In another terminal
$ docker run --rm --name tf -d tensorflow/tensorflow sh -c 'sleep 3'

I can always confirm that docker wait properly waits until the container finishes.

Therefore, the unexpected behavior could be a GKE bug from my guess. Welcome reports whether the flakiness can be observed other than GKE.

alexec commented 3 years ago

This has been reverted because it appears to have been failing the master build.

alexec commented 3 years ago
=== RUN   TestInvalidCommandSuite/TestInvalidCommand
Submitting workflow  cannot-start-
Waiting 1m0s for workflow metadata.name=cannot-start-9wzcj
 ? cannot-start-9wzcj Workflow 0s      

 ● cannot-start-9wzcj   Workflow 0s      
 └ ◷ cannot-start-9wzcj Pod      0s      

 ● cannot-start-9wzcj   Workflow 0s      
 └ ◷ cannot-start-9wzcj Pod      0s      ContainerCreating

 ● cannot-start-9wzcj   Workflow 0s      
 └ ● cannot-start-9wzcj Pod      0s      

    invalid_command_test.go:25: timeout after 1m0s waiting for condition
Checking expectation cannot-start-9wzcj
cannot-start-9wzcj : Running 
    invalid_command_test.go:28: 
            Error Trace:    invalid_command_test.go:28
                                        then.go:65
                                        then.go:40
                                        invalid_command_test.go:27
            Error:          "" does not contain "executable file not found"
            Test:           TestInvalidCommandSuite/TestInvalidCommand
=== FAIL: InvalidCommandSuite/TestInvalidCommand
Bobgy commented 3 years ago

@alexec thanks for taking care of this, can you help me understand how to run e2e tests locally?

alexec commented 3 years ago

they're just normal Golang tests, run them in your IDE

Bobgy commented 3 years ago

Thanks, let me have a try!

Bobgy commented 3 years ago

@alexec I tried running all the unit tests:

go tests ./...

They are all passing, but that didn't run the docker executor e2e test.

/cc @terrytangyuan @sarabala1979

alexec commented 3 years ago

Run

make test-executor E2E_EXECUTOR=docker

Bobgy commented 3 years ago

Thank you! Let me verify that

Bobgy commented 3 years ago

I successfully reproduced the failed case, it's because when the container has an invalid command, the container will fail to run. However, container's state is stuck at "Created" too. We need a way to distinguish that case from this issue's case.

unexpected: container \"main\" still has state \"Created\" after docker wait

Bobgy commented 3 years ago

here's the status we got from docker ps

docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=cannot-start-zbbxv

Created|main|00709e8557ab41e4339732aaf34b7dda8c9ad55e161fb11beab02e7934cf0bda|2021-08-07 12:09:27 +0000 UTC\nUp About a minute|wait|6b6b5c4f553174accda240f63cb3925a09485590ad8c1002a39fa8e728c0c9bc|2021-08-07 12:09:27 +0000 UTC\nUp About a minute|POD|391668eda970115a07463f77e959642945ea5e140e9a8dd16364db7266ddb220|2021-08-07 12:09:26 +0000 UTC\n
Bobgy commented 3 years ago

container status

status:
  containerStatuses:
  - containerID: docker://00709e8557ab41e4339732aaf34b7dda8c9ad55e161fb11beab02e7934cf0bda
    image: argoproj/argosay:v2
    imageID: docker-pullable://argoproj/argosay@sha256:f0b51e5d3a394af492de5ae50af93f11ba0862480d0eb2cbb48286fe58745303
    lastState: {}
    name: main
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://00709e8557ab41e4339732aaf34b7dda8c9ad55e161fb11beab02e7934cf0bda
        exitCode: 127
        finishedAt: "2021-08-07T12:09:27Z"
        message: 'OCI runtime create failed: container_linux.go:367: starting container
          process caused: exec: "invalid": executable file not found in $PATH: unknown'
        reason: ContainerCannotRun
        startedAt: "2021-08-07T12:09:27Z"
  - containerID: docker://6b6b5c4f553174accda240f63cb3925a09485590ad8c1002a39fa8e728c0c9bc
    image: argoproj/argoexec:latest
    imageID: docker://sha256:d4f4644d010fb5e3bd949b69057b25b29f11784ab31d226978906ffd819604e4
    lastState: {}
    name: wait
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://6b6b5c4f553174accda240f63cb3925a09485590ad8c1002a39fa8e728c0c9bc
        exitCode: 137
        finishedAt: "2021-08-07T12:14:27Z"
        message: Step terminated
        reason: Error
        startedAt: "2021-08-07T12:09:27Z"
Bobgy commented 3 years ago

when I inspect the container:

docker inspect <container-id>
...
   "Created": "2021-08-07T12:48:58.742009458Z",
        "Path": "invalid",
        "Args": [],
        "State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 127,
            "Error": "OCI runtime create failed: container_linux.go:367: starting container process caused: exec: \"invalid\": executable file not found in $PATH: unknown",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
Bobgy commented 3 years ago

this situation also seems like a bug in docker, the Status is "created", while "ExitCode" is non zero, and "Error" is non zero.

Bobgy commented 3 years ago

Hmm, at least the behavior can be observed on docker CLI too.

$ docker inspect $(docker run -d --entrypoint invalid argoproj/argosay:v2) | less
"State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 127,
            "Error": "OCI runtime create failed: container_linux.go:367: starting container process caused: exec: \"invalid\": executable file not found in $PATH: unknown",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        }

similarly

Bobgy commented 3 years ago

unfortunately, we cannot include exit code in docker ps format https://docs.docker.com/engine/reference/commandline/ps/#formatting

Bobgy commented 3 years ago

we could use docker inspect to get exit code, updated implementation in #6508

Bobgy commented 3 years ago

For record, here's how I tested docker executor e2e tests locally:

  1. install k3s setting docker as container runtime
  2. use k3s kubeconfig:
    cp /etc/rancher/k3s/k3s.yaml $HOME/.kubeconfig
    export KUBECONFIG=$HOME/.kubeconfig
  3. run workflow controller:
    make start PROFILE=minimal E2E_EXECUTOR=docker AUTH_MODE=client STATIC_FILES=false LOG_LEVEL=info UI=false
  4. build argoexec:
    make argoexec-image
  5. run e2e tests:
    make test-executor E2E_TIMEOUT=1m STATIC_FILES=false