Closed Bobgy closed 3 years ago
All container statuses:
initContainerStatuses:
- name: kfp-launcher
state:
terminated:
exitCode: 0
reason: Completed
startedAt: '2021-07-16T08:55:13Z'
finishedAt: '2021-07-16T08:55:14Z'
containerID: >-
docker://3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41
lastState: {}
ready: true
restartCount: 0
image: 'sha256:d9d584d65d567192276524136625d6d7f6c25e3cc6672e2e4e931ce9347360be'
imageID: >-
docker-pullable://gcr.io/gongyuan-dev/v2-sample-test/kfp-launcher@sha256:13068df2403e854d2d44aa2c66ed318b99d89ec35abfdef3d1b47a102b07f6d1
containerID: >-
docker://3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41
containerStatuses:
- name: main
state:
terminated:
exitCode: 0
reason: Completed
startedAt: '2021-07-16T08:55:35Z'
finishedAt: '2021-07-16T08:55:38Z'
containerID: >-
docker://0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb
lastState: {}
ready: false
restartCount: 0
image: 'python:3.7'
imageID: >-
docker-pullable://python@sha256:c9b155d64106b7c939cb82396251401418dbdd6ccf51a90c9ea4ede13162fa35
containerID: >-
docker://0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb
started: false
- name: wait
state:
terminated:
exitCode: 1
reason: Error
message: >-
path /tmp/outputs/Output/data does not exist in archive
/tmp/argo/outputs/artifacts/training-op-Output.tgz
startedAt: '2021-07-16T08:55:30Z'
finishedAt: '2021-07-16T08:55:36Z'
containerID: >-
docker://cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17
lastState: {}
ready: false
restartCount: 0
image: 'gcr.io/ml-pipeline/argoexec:v3.1.1-patch-license-compliance'
imageID: >-
docker-pullable://gcr.io/ml-pipeline/argoexec@sha256:a27132f35207614ac684dc69e424ccc461ec534c099c28f7ff715b5d035144b7
containerID: >-
docker://cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17
started: false
Kubelet logs related to this pod:
I0716 08:54:59.189645 1432 kubelet.go:1939] SyncLoop (ADD, "api"): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" I0716 08:54:59.248867 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "docker-sock" (UniqueName: "kubernetes.io/host-path/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-docker-sock") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.248930 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "mlpipeline-minio-artifact" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-mlpipeline-minio-artifact") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.249018 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kfp-launcher" (UniqueName: "kubernetes.io/empty-dir/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-kfp-launcher") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.249097 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "podmetadata" (UniqueName: "kubernetes.io/downward-api/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-podmetadata") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.249297 1432 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "pipeline-runner-token-qrtlb" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-pipeline-runner-token-qrtlb") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350010 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "podmetadata" (UniqueName: "kubernetes.io/downward-api/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-podmetadata") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350088 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "pipeline-runner-token-qrtlb" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-pipeline-runner-token-qrtlb") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350127 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "docker-sock" (UniqueName: "kubernetes.io/host-path/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-docker-sock") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350148 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "mlpipeline-minio-artifact" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-mlpipeline-minio-artifact") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350256 1432 reconciler.go:269] operationExecutor.MountVolume started for volume "kfp-launcher" (UniqueName: "kubernetes.io/empty-dir/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-kfp-launcher") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350586 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "docker-sock" (UniqueName: "kubernetes.io/host-path/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-docker-sock") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.350798 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "kfp-launcher" (UniqueName: "kubernetes.io/empty-dir/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-kfp-launcher") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.363995 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "podmetadata" (UniqueName: "kubernetes.io/downward-api/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-podmetadata") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.364539 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "mlpipeline-minio-artifact" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-mlpipeline-minio-artifact") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.365499 1432 operation_generator.go:658] MountVolume.SetUp succeeded for volume "pipeline-runner-token-qrtlb" (UniqueName: "kubernetes.io/secret/dc42a4ff-a48c-4c89-ad2d-7a94032d1e93-pipeline-runner-token-qrtlb") pod "pipeline-with-resource-spec-lzjrz-1397752094" (UID: "dc42a4ff-a48c-4c89-ad2d-7a94032d1e93") I0716 08:54:59.646342 1432 kuberuntime_manager.go:434] No sandbox for pod "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" can be found. Need to start a new one I0716 08:55:23.167460 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerStarted", Data:"ca400df6509ccb22cb04437ca486590c6b8e5833b4d9955fa6629814abde6ce7"} E0716 08:55:26.484429 1432 kuberuntime_manager.go:975] getPodContainerStatuses for pod "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" failed: rpc error: code = Unknown desc = Error: No such container: cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 E0716 08:55:32.496413 1432 kuberuntime_manager.go:975] getPodContainerStatuses for pod "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" failed: rpc error: code = Unknown desc = Error: No such container: 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb I0716 08:55:36.075101 1432 kubelet.go:1946] SyncLoop (UPDATE, "api"): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" I0716 08:55:53.324527 1432 kubelet_pods.go:1232] Killing unwanted pod "pipeline-with-resource-spec-lzjrz-1397752094" I0716 08:56:03.453733 1432 kubelet.go:1946] SyncLoop (UPDATE, "api"): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)" I0716 08:56:09.741854 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41"} I0716 08:56:09.742010 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb"} I0716 08:56:09.742077 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17"} I0716 08:56:23.064280 1432 kubelet.go:1984] SyncLoop (PLEG): "pipeline-with-resource-spec-lzjrz-1397752094_kubeflow(dc42a4ff-a48c-4c89-ad2d-7a94032d1e93)", event: &pleg.PodLifecycleEvent{ID:"dc42a4ff-a48c-4c89-ad2d-7a94032d1e93", Type:"ContainerDied", Data:"ca400df6509ccb22cb04437ca486590c6b8e5833b4d9955fa6629814abde6ce7"}
Logs for wait container:
time="2021-07-16T08:55:30.534Z" level=info msg="Starting Workflow Executor" executorType=docker version=untagged time="2021-07-16T08:55:30.538Z" level=info msg="Creating a docker executor" time="2021-07-16T08:55:30.539Z" level=info msg="Executor initialized" includeScriptOutput=false namespace=kubeflow podName=pipeline-with-resource-spec-lzjrz-1397752094 template="{\"name\":\"training-op\",\"inputs\":{\"parameters\":[{\"name\":\"n\",\"value\":\"11234567\"},{\"name\":\"pipeline-name\",\"value\":\"pipeline/pipeline-with-resource-spec\"},{\"name\":\"pipeline-output-directory\",\"value\":\"gs://gongyuan-dev/v2-sample-test/data/samples_config-loop-item\"}]},\"outputs\":{\"artifacts\":[{\"name\":\"training-op-Output\",\"path\":\"/tmp/outputs/Output/data\"}]},\"metadata\":{\"annotations\":{\"pipelines.kubeflow.org/arguments.parameters\":\"{\\"n\\": \\"11234567\\"}\",\"pipelines.kubeflow.org/component_ref\":\"{}\",\"pipelines.kubeflow.org/max_cache_staleness\":\"P0D\",\"pipelines.kubeflow.org/v2_component\":\"true\",\"sidecar.istio.io/inject\":\"false\"},\"labels\":{\"pipelines.kubeflow.org/cache_enabled\":\"true\",\"pipelines.kubeflow.org/enable_caching\":\"false\",\"pipelines.kubeflow.org/kfp_sdk_version\":\"1.6.4\",\"pipelines.kubeflow.org/pipeline-sdk-type\":\"kfp\",\"pipelines.kubeflow.org/v2_component\":\"true\"}},\"container\":{\"name\":\"\",\"image\":\"python:3.7\",\"command\":[\"/kfp-launcher/launch\",\"--mlmd_server_address\",\"$(METADATA_GRPC_SERVICE_HOST)\",\"--mlmd_server_port\",\"$(METADATA_GRPC_SERVICE_PORT)\",\"--runtime_info_json\",\"$(KFP_V2_RUNTIME_INFO)\",\"--container_image\",\"$(KFP_V2_IMAGE)\",\"--task_name\",\"training-op\",\"--pipeline_name\",\"pipeline/pipeline-with-resource-spec\",\"--run_id\",\"$(KFP_RUN_ID)\",\"--run_resource\",\"workflows.argoproj.io/$(WORKFLOW_ID)\",\"--namespace\",\"$(KFP_NAMESPACE)\",\"--pod_name\",\"$(KFP_POD_NAME)\",\"--pod_uid\",\"$(KFP_POD_UID)\",\"--pipeline_root\",\"gs://gongyuan-dev/v2-sample-test/data/samples_config-loop-item\",\"--enable_caching\",\"$(ENABLE_CACHING)\",\"--\",\"n=11234567\",\"--\"],\"args\":[\"sh\",\"-ec\",\"program_path=$(mktemp)\nprintf \\"%s\\" \\"$0\\" \u003e \\"$program_path\\"\npython3 -u \\"$program_path\\" \\"$@\\"\n\",\"def training_op(n):\n # quickly allocate a lot of memory to verify memory is enough\n a = [i for i in range(n)]\n return len(a)\n\ndef _serialize_int(int_value: int) -\u003e str:\n if isinstance(int_value, str):\n return int_value\n if not isinstance(int_value, int):\n raise TypeError('Value \\"{}\\" has type \\"{}\\" instead of int.'.format(str(int_value), str(type(int_value))))\n return str(int_value)\n\nimport argparse\n_parser = argparse.ArgumentParser(prog='Training op', description='')\n_parser.add_argument(\\"--n\\", dest=\\"n\\", type=int, required=True, default=argparse.SUPPRESS)\n_parser.add_argument(\\"----output-paths\\", dest=\\"_output_paths\\", type=str, nargs=1)\n_parsed_args = vars(_parser.parse_args())\n_output_files = _parsed_args.pop(\\"_output_paths\\", [])\n\n_outputs = training_op(**_parsed_args)\n\n_outputs = [_outputs]\n\n_output_serializers = [\n _serialize_int,\n\n]\n\nimport os\nfor idx, output_file in enumerate(_output_files):\n try:\n os.makedirs(os.path.dirname(output_file))\n except OSError:\n pass\n with open(output_file, 'w') as f:\n f.write(_output_serializersidx)\n\",\"--n\",\"{{$.inputs.parameters['n']}}\",\"----output-paths\",\"{{$.outputs.parameters['Output'].output_file}}\"],\"envFrom\":[{\"configMapRef\":{\"name\":\"metadata-grpc-configmap\",\"optional\":true}}],\"env\":[{\"name\":\"KFP_POD_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.name\"}}},{\"name\":\"KFP_POD_UID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.uid\"}}},{\"name\":\"KFP_NAMESPACE\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.namespace\"}}},{\"name\":\"WORKFLOW_ID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['workflows.argoproj.io/workflow']\"}}},{\"name\":\"KFP_RUN_ID\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['pipeline/runid']\"}}},{\"name\":\"ENABLE_CACHING\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['pipelines.kubeflow.org/enable_caching']\"}}},{\"name\":\"KFP_V2_IMAGE\",\"value\":\"python:3.7\"},{\"name\":\"KFP_V2_RUNTIME_INFO\",\"value\":\"{\\"inputParameters\\": {\\"n\\": {\\"type\\": \\"INT\\"}}, \\"inputArtifacts\\": {}, \\"outputParameters\\": {\\"Output\\": {\\"type\\": \\"INT\\", \\"path\\": \\"/tmp/outputs/Output/data\\"}}, \\"outputArtifacts\\": {}}\"}],\"resources\":{\"limits\":{\"cpu\":\"1\",\"memory\":\"500Mi\"}},\"volumeMounts\":[{\"name\":\"kfp-launcher\",\"mountPath\":\"/kfp-launcher\"}]},\"volumes\":[{\"name\":\"kfp-launcher\"}],\"initContainers\":[{\"name\":\"kfp-launcher\",\"image\":\"gcr.io/gongyuan-dev/v2-sample-test/kfp-launcher@sha256:13068df2403e854d2d44aa2c66ed318b99d89ec35abfdef3d1b47a102b07f6d1\",\"command\":[\"/bin/mount_launcher.sh\"],\"resources\":{},\"mirrorVolumeMounts\":true}],\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/pipeline-with-resource-spec-lzjrz/2021/07/16/pipeline-with-resource-spec-lzjrz-1397752094\"}}}" version="&Version{Version:untagged,BuildDate:2021-07-13T12:00:17Z,GitCommit:e72989db843bf092811274ed9c84abad18aaccc5,GitTag:untagged,GitTreeState:clean,GoVersion:go1.15.7,Compiler:gc,Platform:linux/amd64,}" time="2021-07-16T08:55:30.539Z" level=info msg="Starting annotations monitor" time="2021-07-16T08:55:30.540Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:30.541Z" level=info msg="Starting deadline monitor" time="2021-07-16T08:55:30.683Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509
}} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Created {0 63762022523 }}]" time="2021-07-16T08:55:30.683Z" level=info msg="mapped container name \"wait\" to container ID \"cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17\" (created at 2021-07-16 08:55:23 +0000 UTC, status Created)" time="2021-07-16T08:55:30.683Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:31.683Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:31.738Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:31.738Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:32.738Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:32.798Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Created {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:32.799Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:32.799Z" level=info msg="mapped container name \"main\" to container ID \"0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb\" (created at 2021-07-16 08:55:31 +0000 UTC, status Created)" time="2021-07-16T08:55:33.540Z" level=info msg="docker wait 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb" time="2021-07-16T08:55:33.610Z" level=info msg="Main container completed" time="2021-07-16T08:55:33.610Z" level=info msg="No Script output reference in workflow. Capturing script output ignored" time="2021-07-16T08:55:33.610Z" level=info msg="Saving logs" time="2021-07-16T08:55:33.610Z" level=info msg="[docker logs 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb]" time="2021-07-16T08:55:33.799Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:33.850Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Created {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:33.850Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:34.851Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:34.899Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Created {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:34.899Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:35.783Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: artifacts/pipeline-with-resource-spec-lzjrz/2021/07/16/pipeline-with-resource-spec-lzjrz-1397752094/main.log" time="2021-07-16T08:55:35.783Z" level=info msg="Creating minio client minio-service.kubeflow:9000 using static credentials" time="2021-07-16T08:55:35.783Z" level=info msg="Saving from /tmp/argo/outputs/logs/main.log to s3 (endpoint: minio-service.kubeflow:9000, bucket: mlpipeline, key: artifacts/pipeline-with-resource-spec-lzjrz/2021/07/16/pipeline-with-resource-spec-lzjrz-1397752094/main.log)" time="2021-07-16T08:55:35.899Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:35.905Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log time="2021-07-16T08:55:35.905Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log" time="2021-07-16T08:55:35.905Z" level=info msg="No output parameters" time="2021-07-16T08:55:35.905Z" level=info msg="Saving output artifacts" time="2021-07-16T08:55:35.905Z" level=info msg="Staging artifact: training-op-Output" time="2021-07-16T08:55:35.905Z" level=info msg="Copying /tmp/outputs/Output/data from container base image layer to /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:35.905Z" level=info msg="Archiving main:/tmp/outputs/Output/data to /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:35.906Z" level=info msg="sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:35.958Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Up {0 63762022531 }} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:35.958Z" level=info msg="ignoring container \"kfp-launcher\" created at 2021-07-16 08:55:09 +0000 UTC, too long before process started" time="2021-07-16T08:55:36.028Z" level=warning msg="path /tmp/outputs/Output/data does not exist in archive /tmp/argo/outputs/artifacts/training-op-Output.tgz" time="2021-07-16T08:55:36.028Z" level=error msg="executor error: path /tmp/outputs/Output/data does not exist in archive /tmp/argo/outputs/artifacts/training-op-Output.tgz\ngithub.com/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngithub.com/argoproj/argo-workflows/v3/errors.Errorf\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:55\ngithub.com/argoproj/argo-workflows/v3/workflow/executor/docker.(DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/docker/docker.go:88\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:402\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:290\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:276\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:62\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:18\ngithub.com/spf13/cobra.(Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngithub.com/spf13/cobra.(Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374" time="2021-07-16T08:55:36.028Z" level=info msg="Annotating pod with output" time="2021-07-16T08:55:36.068Z" level=info msg="Patch pods 200" time="2021-07-16T08:55:36.080Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=pipeline-with-resource-spec-lzjrz-1397752094" time="2021-07-16T08:55:36.237Z" level=info msg="listed containers" containers="map[kfp-launcher:{3aeac64b1711f803e56420dc95cc74ec92ec27120f55ddf60526afcc1e704e41 Exited {0 63762022509 }} main:{0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb Up {0 63762022531 DockerExecutor).CopyFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/docker/docker.go:88\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).stageArchiveFile\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:402\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).saveArtifact\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:290\ngithub.com/argoproj/argo-workflows/v3/workflow/executor.(WorkflowExecutor).SaveArtifacts\n\t/go/src/github.com/argoproj/argo-workflows/workflow/executor/executor.go:276\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:62\ngithub.com/argoproj/argo-workflows/v3/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/commands/wait.go:18\ngithub.com/spf13/cobra.(Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:846\ngithub.com/spf13/cobra.(Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.0.0/command.go:887\nmain.main\n\t/go/src/github.com/argoproj/argo-workflows/cmd/argoexec/main.go:15\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}} wait:{cfb2b6e1bfa746a65fc2b1473a3febc43e2f5694dd5816a80d2bdbabf881cd17 Up {0 63762022523 }}]" time="2021-07-16T08:55:36.237Z" level=info msg="Killing sidecars [\"kfp-launcher\"]" time="2021-07-16T08:55:36.252Z" level=info msg="Get pods 200" time="2021-07-16T08:55:36.258Z" level=info msg="zero container IDs, assuming all containers have exited successfully" time="2021-07-16T08:55:36.258Z" level=info msg="Alloc=10431 TotalAlloc=16471 Sys=74065 NumGC=5 Goroutines=12" time="2021-07-16T08:55:36.369Z" level=fatal msg="path /tmp/outputs/Output/data does not exist in archive /tmp/argo/outputs/artifacts/training-op-Output.tgz\ngithub.com/argoproj/argo-workflows/v3/errors.New\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:49\ngithub.com/argoproj/argo-workflows/v3/errors.Errorf\n\t/go/src/github.com/argoproj/argo-workflows/errors/errors.go:55\ngithub.com/argoproj/argo-workflows/v3/workflow/executor/docker.(
FYI, because Kubeflow Pipelines want to use emissary executor so much. I won't further invest in docker executor now. Feel free to close or leave it here to see if others hit the same problems.
I'm resuming working on this, because it was a little rush for KFP to default to emissary right now. Our plan is to keep supporting docker executor for 1 version and let people test out emissary executor on demand.
I can now stably reproduce this problem on GKE by running a lot of workflows in parallel, some will fail with this error.
I added some extra logging to argo executor and deployed to my cluster, got the following logs on a failed task:
time="2021-08-04T09:30:58.563Z" level=info msg="docker wait 5ac4b5f49f81e3468d593ee4aa5a987da8139d6a8af61613f0d5579055684716"
time="2021-08-04T09:30:58.759Z" level=info msg="docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=kubeflow --filter=label=io.kubernetes.pod.name=metrics-visualization-pipeline-b5f6k-2986647169"
time="2021-08-04T09:30:58.760Z" level=info msg="listed containers" containers="map[kfp-launcher:{efb685efc742193570355cdb14eeeae71710d489b8a61bea4d07ffd41782b3d6 Exited {0 63763666237
}} main:{5ac4b5f49f81e3468d593ee4aa5a987da8139d6a8af61613f0d5579055684716 Created {0 63763666256 }} wait:{4bd0f867061e5ac69f35fc5e5e1d3c21a7c369d6ee112b3b862b4cbf1db79c1e Up {0 63763666253 }}]"
It seems that when docker wait was run, the container was in "Created" state. docker wait
succeeded almost right away, but the container is still in "Created" state when docker wait
succeeded.
I tried to verify this behavior locally on my laptop by running docker wait in a loop waiting for a container to be created, then create a container alongside it.
# In one terminal
$ while true; do docker wait tf && echo "done" && break; done
# In another terminal
$ docker run --rm --name tf -d tensorflow/tensorflow sh -c 'sleep 3'
I can always confirm that docker wait
properly waits until the container finishes.
Therefore, the unexpected behavior could be a GKE bug from my guess. Welcome reports whether the flakiness can be observed other than GKE.
This has been reverted because it appears to have been failing the master build.
=== RUN TestInvalidCommandSuite/TestInvalidCommand
Submitting workflow cannot-start-
Waiting 1m0s for workflow metadata.name=cannot-start-9wzcj
? cannot-start-9wzcj Workflow 0s
● cannot-start-9wzcj Workflow 0s
└ ◷ cannot-start-9wzcj Pod 0s
● cannot-start-9wzcj Workflow 0s
└ ◷ cannot-start-9wzcj Pod 0s ContainerCreating
● cannot-start-9wzcj Workflow 0s
└ ● cannot-start-9wzcj Pod 0s
invalid_command_test.go:25: timeout after 1m0s waiting for condition
Checking expectation cannot-start-9wzcj
cannot-start-9wzcj : Running
invalid_command_test.go:28:
Error Trace: invalid_command_test.go:28
then.go:65
then.go:40
invalid_command_test.go:27
Error: "" does not contain "executable file not found"
Test: TestInvalidCommandSuite/TestInvalidCommand
=== FAIL: InvalidCommandSuite/TestInvalidCommand
@alexec thanks for taking care of this, can you help me understand how to run e2e tests locally?
they're just normal Golang tests, run them in your IDE
Thanks, let me have a try!
@alexec I tried running all the unit tests:
go tests ./...
They are all passing, but that didn't run the docker executor e2e test.
/cc @terrytangyuan @sarabala1979
Run
make test-executor E2E_EXECUTOR=docker
Thank you! Let me verify that
I successfully reproduced the failed case, it's because when the container has an invalid command, the container will fail to run. However, container's state is stuck at "Created" too. We need a way to distinguish that case from this issue's case.
unexpected: container \"main\" still has state \"Created\" after docker wait
here's the status we got from docker ps
docker ps --all --no-trunc --format={{.Status}}|{{.Label \"io.kubernetes.container.name\"}}|{{.ID}}|{{.CreatedAt}} --filter=label=io.kubernetes.pod.namespace=argo --filter=label=io.kubernetes.pod.name=cannot-start-zbbxv
Created|main|00709e8557ab41e4339732aaf34b7dda8c9ad55e161fb11beab02e7934cf0bda|2021-08-07 12:09:27 +0000 UTC\nUp About a minute|wait|6b6b5c4f553174accda240f63cb3925a09485590ad8c1002a39fa8e728c0c9bc|2021-08-07 12:09:27 +0000 UTC\nUp About a minute|POD|391668eda970115a07463f77e959642945ea5e140e9a8dd16364db7266ddb220|2021-08-07 12:09:26 +0000 UTC\n
container status
status:
containerStatuses:
- containerID: docker://00709e8557ab41e4339732aaf34b7dda8c9ad55e161fb11beab02e7934cf0bda
image: argoproj/argosay:v2
imageID: docker-pullable://argoproj/argosay@sha256:f0b51e5d3a394af492de5ae50af93f11ba0862480d0eb2cbb48286fe58745303
lastState: {}
name: main
ready: false
restartCount: 0
state:
terminated:
containerID: docker://00709e8557ab41e4339732aaf34b7dda8c9ad55e161fb11beab02e7934cf0bda
exitCode: 127
finishedAt: "2021-08-07T12:09:27Z"
message: 'OCI runtime create failed: container_linux.go:367: starting container
process caused: exec: "invalid": executable file not found in $PATH: unknown'
reason: ContainerCannotRun
startedAt: "2021-08-07T12:09:27Z"
- containerID: docker://6b6b5c4f553174accda240f63cb3925a09485590ad8c1002a39fa8e728c0c9bc
image: argoproj/argoexec:latest
imageID: docker://sha256:d4f4644d010fb5e3bd949b69057b25b29f11784ab31d226978906ffd819604e4
lastState: {}
name: wait
ready: false
restartCount: 0
state:
terminated:
containerID: docker://6b6b5c4f553174accda240f63cb3925a09485590ad8c1002a39fa8e728c0c9bc
exitCode: 137
finishedAt: "2021-08-07T12:14:27Z"
message: Step terminated
reason: Error
startedAt: "2021-08-07T12:09:27Z"
when I inspect the container:
docker inspect <container-id>
...
"Created": "2021-08-07T12:48:58.742009458Z",
"Path": "invalid",
"Args": [],
"State": {
"Status": "created",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 127,
"Error": "OCI runtime create failed: container_linux.go:367: starting container process caused: exec: \"invalid\": executable file not found in $PATH: unknown",
"StartedAt": "0001-01-01T00:00:00Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
this situation also seems like a bug in docker, the Status is "created", while "ExitCode" is non zero, and "Error" is non zero.
Hmm, at least the behavior can be observed on docker CLI too.
$ docker inspect $(docker run -d --entrypoint invalid argoproj/argosay:v2) | less
"State": {
"Status": "created",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 127,
"Error": "OCI runtime create failed: container_linux.go:367: starting container process caused: exec: \"invalid\": executable file not found in $PATH: unknown",
"StartedAt": "0001-01-01T00:00:00Z",
"FinishedAt": "0001-01-01T00:00:00Z"
}
similarly
unfortunately, we cannot include exit code in docker ps format https://docs.docker.com/engine/reference/commandline/ps/#formatting
we could use docker inspect to get exit code, updated implementation in #6508
For record, here's how I tested docker executor e2e tests locally:
cp /etc/rancher/k3s/k3s.yaml $HOME/.kubeconfig
export KUBECONFIG=$HOME/.kubeconfig
make start PROFILE=minimal E2E_EXECUTOR=docker AUTH_MODE=client STATIC_FILES=false LOG_LEVEL=info UI=false
make argoexec-image
make test-executor E2E_TIMEOUT=1m STATIC_FILES=false
Downstream KFP issue: https://github.com/kubeflow/pipelines/issues/5943
Summary
What happened/what you expected to happen?
When I run many workflows in a cluster, some workflows fail at random (5~10%) steps with errors like the following message:
The error can happen at any step randomly.
I'd expect argo can identify root cause of this and return a meaningful error message.
Diagnostics
👀 Yes! We need all of your diagnostics, please make sure you add it all, otherwise we'll go around in circles asking you for it:
What Kubernetes provider are you using? GKE (I'm using Kubernetes 1.18 with Container-optimized OS with docker nodes)
What version of Argo Workflows are you running? v3.1.1
What executor are you running? Docker
Did this work in a previous version? I.e. is it a regression? I'm not seeing this before in v2.12.x and v3.0.x
Investigations
After reading through the detailed logs, here's my understanding of the problem:
Therefore, all the containers receive a TERM signal at the same time.(this is probably wrong)sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz
Error: No such container:path: 32b49d8ac659f4e77ec768bd22ca38cfa97abd2006a185a4cce5c7d4a4f418f5:/tmp/outputs/sum/data tar: This does not look like a tar archive tar
ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination
Suggestions
I don't understand wait container details. For the final error message, I think we can improve by fixing the
sh -c docker cp -a 0494f392c91a41d7f4cf878f9ced4c56f3a030b4b64fb7b3f1f6828db982a9eb:/tmp/outputs/Output/data - | gzip > /tmp/argo/outputs/artifacts/training-op-Output.tgz
command.Note, this script's exit code is always the second command
gzip > /tmp/.....
. It doesn't matter the first command failed or not. This is a problem of shell when using pipe.One workaround is using go to pipe command outputs: https://stackoverflow.com/a/10781582/8745218 and check exit code of both commands properly, so that we'll return error message of the
docker cp
command when it fails. I think the file /tmp/outputs/Output/data doesn't exist at the time this command was run.Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.