argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.96k stars 3.19k forks source link

Resource template task hangs forever with Istio sidecar #7273

Open johnbuluba opened 2 years ago

johnbuluba commented 2 years ago

Summary

What happened/what you expected to happen?

We have a Workflow that has resource templates and Istio sidecar injection enabled. After the main container creates the resources, the istio-proxy sidecar is not stopped. Because of this, the Pod stays in the Running state and the task hangs forever.

We expected that the istio-proxy will be terminated after the task completes. This works as expected in other templates, e.g. with the container template, where the wait container kills the sidecars.

What version of Argo Workflows are you running? 3.1.6

Diagnostics

This is an example Workflow that uses a resource template with Istio sidecar injection enabled. The following should be created in a namespace with Istio enabled (with label istio-injection=enabled).

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: k8s-jobs-
  name: resource-istio-sidecar
  namespace: istio-enabled-namespace
spec:
  entrypoint: pi-tmpl
  serviceAccountName: pipeline-runner
  templates:
  - name: pi-tmpl
    metadata:
      annotations:
        proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'
        sidecar.istio.io/inject: 'true'
    resource:                  
      action: create           
      successCondition: status.succeeded > 0
      failureCondition: status.failed > 3
      manifest: |               
        apiVersion: batch/v1
        kind: Job
        metadata:
          generateName: pi-job-
        spec:
          template:
            metadata:
              name: pi
              annotations:
                sidecar.istio.io/inject: 'false'
            spec:
              containers:
              - name: pi
                image: perl
                command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
              restartPolicy: Never
          backoffLimit: 4

What Kubernetes provider are you using? EKS

What executor are you running? PNS

# Logs from the workflow controller:
time="2021-11-24T08:52:54.288Z" level=info msg="Update leases 200"
time="2021-11-24T08:52:58.137Z" level=info msg="Processing workflow" namespace=personal-user workflow=resource-istio-sidecar
time="2021-11-24T08:52:58.143Z" level=info msg="Get configmaps 404"
time="2021-11-24T08:52:58.143Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-11-24T08:52:58.143Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-11-24T08:52:58.143Z" level=info msg="Updated phase  -> Running" namespace=personal-user workflow=resource-istio-sidecar
time="2021-11-24T08:52:58.143Z" level=info msg="Pod node resource-istio-sidecar initialized Pending" namespace=personal-user workflow=resource-istio-sidecar
time="2021-11-24T08:52:58.148Z" level=info msg="Create events 201"
time="2021-11-24T08:52:58.291Z" level=info msg="Create pods 201"
time="2021-11-24T08:52:58.302Z" level=info msg="Created pod: resource-istio-sidecar (resource-istio-sidecar)" namespace=personal-user workflow=resource-istio-sidecar
time="2021-11-24T08:52:58.314Z" level=info msg="Update workflows 200"
time="2021-11-24T08:52:58.321Z" level=info msg="Workflow update successful" namespace=personal-user phase=Running resourceVersion=4699517 workflow=resource-istio-sidecar
time="2021-11-24T08:52:59.292Z" level=info msg="Get leases 200"
time="2021-11-24T08:52:59.299Z" level=info msg="Update leases 200"
time="2021-11-24T08:53:04.311Z" level=info msg="Get leases 200"
time="2021-11-24T08:53:04.316Z" level=info msg="Update leases 200"
time="2021-11-24T08:53:08.311Z" level=info msg="Processing workflow" namespace=personal-user workflow=resource-istio-sidecar
time="2021-11-24T08:53:08.311Z" level=info msg="Updating node resource-istio-sidecar status Pending -> Running" namespace=personal-user workflow=resource-istio-sidecar
time="2021-11-24T08:53:08.321Z" level=info msg="Update workflows 200"
time="2021-11-24T08:53:08.322Z" level=info msg="Workflow update successful" namespace=personal-user phase=Running resourceVersion=4699694 workflow=resource-istio-sidecar
time="2021-11-24T08:53:08.327Z" level=info msg="Create events 201"
time="2021-11-24T08:53:09.322Z" level=info msg="Get leases 200"
time="2021-11-24T08:53:09.326Z" level=info msg="Update leases 200"
time="2021-11-24T08:53:14.331Z" level=info msg="Get leases 200"
time="2021-11-24T08:53:14.336Z" level=info msg="Update leases 200"
time="2021-11-24T08:53:19.340Z" level=info msg="Get leases 200"
time="2021-11-24T08:53:19.348Z" level=info msg="Update leases 200"
time="2021-11-24T08:53:23.474Z" level=info msg="Processing workflow" namespace=personal-user workflow=resource-istio-sidecar
time="2021-11-24T08:53:23.479Z" level=info msg="cleaning up pod" action=terminateContainers key=personal-user/resource-istio-sidecar/terminateContainers
time="2021-11-24T08:53:23.480Z" level=info msg="https://10.100.0.1:443/api/v1/namespaces/personal-user/pods/resource-istio-sidecar/exec?command=%2Fbin%2Fsh&command=-c&command=kill+-15+1&container=istio-proxy&stderr=true&stdout=true&tty=false"
time="2021-11-24T08:53:23.518Z" level=info msg="Create pods/exec 101"
time="2021-11-24T08:53:23.589Z" level=info msg="signaled container" container=istio-proxy error="command terminated with exit code 1" namespace=personal-user pod=resource-istio-sidecar stderr="<nil>" stdout="<nil>"
time="2021-11-24T08:53:23.589Z" level=warning msg="failed to clean-up pod" action=terminateContainers error="command terminated with exit code 1" key=personal-user/resource-istio-sidecar/terminateContainers
time="2021-11-24T08:53:23.589Z" level=warning msg="Non-transient error: command terminated with exit code 1"
time="2021-11-24T08:53:24.354Z" level=info msg="Get leases 200"

# The workflow's pods that are problematic:
time="2021-11-24T09:04:29.651Z" level=info msg="Starting Workflow Executor" executorType=pns version=v3.1.6-patch
time="2021-11-24T09:04:29.654Z" level=info msg="Creating PNS executor (namespace: personal-user, pod: resource-istio-sidecar, pid: 69)"
time="2021-11-24T09:04:29.654Z" level=info msg="Creating a K8sAPI executor"
time="2021-11-24T09:04:29.654Z" level=info msg="Executor initialized" includeScriptOutput=false namespace=personal-user podName=resource-istio-sidecar template="{\"name\":\"pi-tmpl\",\"inputs\":{},\"outputs\":{},\"metadata\":{\"annotations\":{\"proxy.istio.io/config\":\"{ \\\"holdApplicationUntilProxyStarts\\\": true }\",\"sidecar.istio.io/inject\":\"true\"}},\"resource\":{\"action\":\"create\",\"manifest\":\"apiVersion: batch/v1\\nkind: Job\\nmetadata:\\n  generateName: pi-job-\\nspec:\\n  template:\\n    metadata:\\n      name: pi\\n      annotations:\\n        sidecar.istio.io/inject: 'false'\\n    spec:\\n      containers:\\n      - name: pi\\n        image: perl\\n        command: [\\\"perl\\\",  \\\"-Mbignum=bpi\\\", \\\"-wle\\\", \\\"print bpi(2000)\\\"]\\n      restartPolicy: Never\\n  backoffLimit: 4\\n\",\"successCondition\":\"status.succeeded \\u003e 0\",\"failureCondition\":\"status.failed \\u003e 3\"},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"minio-service.kubeflow:9000\",\"bucket\":\"mlpipeline\",\"insecure\":true,\"accessKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"accesskey\"},\"secretKeySecret\":{\"name\":\"mlpipeline-minio-artifact\",\"key\":\"secretkey\"},\"key\":\"artifacts/resource-istio-sidecar/2021/11/24/resource-istio-sidecar\"}}}" version="&Version{Version:v3.1.6-patch,BuildDate:2021-08-18T12:50:41Z,GitCommit:9c47963b66061143735843db27977dbf9b4cbbf4,GitTag:v3.1.6-patch,GitTreeState:clean,GoVersion:go1.15.7,Compiler:gc,Platform:linux/amd64,}"
time="2021-11-24T09:04:29.654Z" level=info msg="Loading manifest to /tmp/manifest.yaml"
time="2021-11-24T09:04:29.654Z" level=info msg="kubectl create -f /tmp/manifest.yaml -o json"
time="2021-11-24T09:04:30.632Z" level=info msg="Resource: personal-user/job.batch/pi-job-7467z. SelfLink: apis/batch/v1/namespaces/personal-user/jobs/pi-job-7467z"
time="2021-11-24T09:04:30.632Z" level=info msg="Starting SIGUSR2 signal monitor"
time="2021-11-24T09:04:30.632Z" level=info msg="Waiting for conditions: status.succeeded>0"
time="2021-11-24T09:04:30.632Z" level=info msg="Failing for conditions: status.failed>3"
time="2021-11-24T09:04:30.641Z" level=info msg="Get jobs 200"
time="2021-11-24T09:04:30.641Z" level=info msg="failure condition '{status.failed gt [3]}' evaluated false"
time="2021-11-24T09:04:30.641Z" level=info msg="success condition '{status.succeeded gt [0]}' evaluated false"
time="2021-11-24T09:04:30.641Z" level=info msg="0/1 success conditions matched"
time="2021-11-24T09:04:30.641Z" level=info msg="Waiting for resource job.batch/pi-job-7467z in namespace personal-user resulted in retryable error: Neither success condition nor the failure condition has been matched. Retrying..."
time="2021-11-24T09:04:35.645Z" level=info msg="Get jobs 200"
time="2021-11-24T09:04:35.645Z" level=info msg="failure condition '{status.failed gt [3]}' evaluated false"
time="2021-11-24T09:04:35.645Z" level=info msg="success condition '{status.succeeded gt [0]}' evaluated false"
time="2021-11-24T09:04:35.645Z" level=info msg="0/1 success conditions matched"
time="2021-11-24T09:04:35.645Z" level=info msg="Waiting for resource job.batch/pi-job-7467z in namespace personal-user resulted in retryable error: Neither success condition nor the failure condition has been matched. Retrying..."
time="2021-11-24T09:04:40.646Z" level=info msg="Get jobs 200"
time="2021-11-24T09:04:40.646Z" level=info msg="failure condition '{status.failed gt [3]}' evaluated false"
time="2021-11-24T09:04:40.646Z" level=info msg="success condition '{status.succeeded gt [0]}' evaluated true"
time="2021-11-24T09:04:40.646Z" level=info msg="1/1 success conditions matched"
time="2021-11-24T09:04:40.646Z" level=info msg="Returning from successful wait for resource job.batch/pi-job-7467z in namespace personal-user"
time="2021-11-24T09:04:40.646Z" level=info msg="No output parameters"

# Logs from in your workflow's wait container, something like:
# There is no wait container

Finally I'd like to ask:

  1. Is there a reason this is not supported for resource templates, or is this simply an omission?
  2. Is there any work-around I can do to make resource templates work with Istio sidecars?
  3. Are you planning on fixing this? Do you have an ETA?

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

sarabala1979 commented 2 years ago

@johnbuluba I. think. for the Resource template, the controller will not inject the wait container. into pod. So there is no sidecar kill.
This is never supported. it is new usecase. Do you like to submit. the PR for. this? Argo core team may not have istio knowledge.

this code should be add end of this function. wfExecutor.KillSidecars(ctx)

https://github.com/argoproj/argo-workflows/blob/312f6463b47588a99050533911ab7e1a9c112136/cmd/argoexec/commands/resource.go#L66

johnbuluba commented 2 years ago

@sarabala1979 Thanks a lot for your response!

This is never supported. it is new usecase.

Looking at the docs and the related issue #1282 there is nothing indicating that resource templates are not working. Is there a doc or issue that documents what will work and what not?

this code should be add end of this function. wfExecutor.KillSidecars(ctx)

This didn't solve the issue. Adding this line resulted with the main container exiting with error:

"executor error: failed to get container name for process 26: open /proc/26/environ: permission denied"

To solve this I had to also add the SYS_PTRACE capability to the main container by editing this function https://github.com/argoproj/argo-workflows/blob/312f6463b47588a99050533911ab7e1a9c112136/workflow/controller/operator.go#L2829 and adding the following snippet:

    // Add the required capabilities to be able to kill sidecars
    if woc.getContainerRuntimeExecutor() == common.ContainerRuntimeExecutorPNS {
        mainCtr.SecurityContext = &apiv1.SecurityContext{
            Capabilities: &apiv1.Capabilities{
                Add: []apiv1.Capability{
                    // necessary to access sidecars environment
                    apiv1.Capability("SYS_PTRACE"),
                },
            },
        }
    }

If this solution is OK, I would be very happy to submit a PR.

alexec commented 2 years ago

I believe this should work. Can you please try with Emissary executor in v3.2.

licheng5625 commented 2 years ago

I believe this should work. Can you please try with Emissary executor in v3.2.

still dose not work with version workflow-controller:v3.3.5

videlov commented 1 year ago

image

Issue still exists with Argo Workflows v3.4.9 and Istio 1.18.2. Workflow just hangs and workflow step pod stays with istio-proxy container running.

Easily reproducible with:

kubectl create namespace argo-istio
kubectl label namespace argo-istio istio-injection=enabled --overwrite
argo submit -n argo-istio --watch https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/steps.yaml