argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.89k stars 3.17k forks source link

Workflows with istio (or other service mesh) #1282

Closed dicarlo2 closed 3 years ago

dicarlo2 commented 5 years ago

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

What happened: Argo attempted to store the docker logs to s3 after killing the istio sidecar. This means that it was impossible to connect to s3 since all traffic is iptable redirected to the (now killed) istio proxy.

What you expected to happen: The "wait" container leaves the sidecars (or perhaps designated ones) alive until artifact (logs, etc.) storage is complete.

How to reproduce it (as minimally and precisely as possible):

1/ Configure a cluster with istio 2/ Configure a workflow with a template that has an istio sidecar. We did this manually rather than injecting since argo doesn't recognize the injected container as a sidecar that needs killing. 3/ Specify s3 configuration and archiveLogs in the argo configuration 4/ Run the workflow

ddseapy commented 4 years ago

@dicarlo2 Were you ever able to figure out a way to get an istio-injected workflow working with artifact support?

ddseapy commented 4 years ago

Maybe configuring istio to avoid intercepting s3 traffic with traffic.sidecar.istio.io/excludeOutboundIPRanges?

ddseapy commented 4 years ago

As you suggested, delaying killing sidecars in waitContainer() works. Also had to workaround https://github.com/istio/istio/issues/9454 but didn't require changes in argo or istio code.

Unsure if this kind of change is something argo maintainers want as a PR, or if there are downsides to doing this.

func waitContainer() error {
    wfExecutor := initExecutor()
    defer wfExecutor.HandleError()
    defer stats.LogStats()
    stats.StartStatsTicker(5 * time.Minute)

    // Wait for main container to complete
    err := wfExecutor.Wait()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars & save outputs
    }
    // Saving logs
    logArt, err := wfExecutor.SaveLogs()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars
    }
    // Saving output parameters
    err = wfExecutor.SaveParameters()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars
    }
    // Saving output artifacts
    err = wfExecutor.SaveArtifacts()
    if err != nil {
        wfExecutor.AddError(err)
        // do not return here so we can still try to kill sidecars
    }
    // Killing sidecar containers
    err = wfExecutor.KillSidecars()
    if err != nil {
        wfExecutor.AddError(err)
        return err
    }
    // Capture output script result
    err = wfExecutor.CaptureScriptResult()
    if err != nil {
        wfExecutor.AddError(err)
        return err
    }
    err = wfExecutor.AnnotateOutputs(logArt)
    if err != nil {
        wfExecutor.AddError(err)
        return err
    }
    return nil
}
sarabala1979 commented 4 years ago

Please go-ahead to create PR with changes. we will review and merge. Thanks for contributing

nfickas commented 4 years ago

@sarabala1979 Is there any way to get an Argo Workflow to recognize an automatically injected istio sidecar?

ddseapy commented 4 years ago

@nfickas is there an issue you are running into with istio sidecars, or what do you mean by "recognize"? We have been using 2.4.2 with istio sidecars automatically injected without any issues.

ddseapy commented 4 years ago

For us this issue was resolved here: https://github.com/argoproj/argo/pull/1645

nfickas commented 4 years ago

@ddseapy our workflows start up with the automatic sidecars but then the sidecars are stuck in a running state after the workflow completes.

nfickas commented 4 years ago

I am seeing some odd behavior on the wait container, looks like it Errored out. Failed to establish pod watch: Get https://10.100.200.1:443/api/v1/namespaces/default/pods?fieldSelector=metadata.name%3Dhello-world-82bm7&watch=true: dial tcp 10.100.200.1:443: connect: connection refused

ddseapy commented 4 years ago

That's the istio issue i linked above. The underlying issue is: https://github.com/istio/istio/issues/11130

Really it's because k8s sidecar support PR (https://github.com/kubernetes/enhancements/issues/753) hasn't gotten through, so there is no ordering which causes a race condition.

You can exclude the k8s api server from the IP range that envoy intercepts, or do things like change the docker command. It could also be fixed in argo by retrying the request on failure with code changes in argo.

nfickas commented 4 years ago

I would rather not making any changes to argo code, out of the other two options which one would be easier and do you have any examples of either of those being done?

Just curious did you run into this issue when getting argo working with injected sidecars?

jaygorrell commented 4 years ago

The easiest thing to do is to just annotate your Argo worfklows so that the auto-injector doesn't add a sidecar to them. Better support is coming in k8s (maybe in 1.8) but it's just not there yet.

Jobs in k8s are the same way... there isn't really anything special about Argo there.

Edit: This would be spec.templates[].metadata.annotations: sidecar.istio.io/inject: "false"

ddseapy commented 4 years ago

if you don't need sidecar, that is indeed easiest. We need to control traffic for our applications, so that is not an option for us. For the workflows we run it was sufficient to exclude the k8s master IP range via traffic.sidecar.istio.io/excludeOutboundIPRanges

nfickas commented 4 years ago

We need sidecars as well, I will try that out. Thanks!

jaygorrell commented 4 years ago

That's a good point, too. But just to be clear, I do believe that is going to make the pod stay up when the workflow is finished - correct? You may need a separate system to clean those up, I believe. I'm going off memory and could be mistaken.

ddseapy commented 4 years ago

To state the obvious, requests to k8s api server will then bypass istio. This was not a security concern for our use-cases, but it could be for someone else.

ddseapy commented 4 years ago

@jaygorrell that's why I put up the pr. we have been using 2.4.2 for quite a while now without issues, and the pod goes to completed state.

ddseapy commented 4 years ago

For visibility - Prasanna Kumar Krishnamurthy in argo workflow slack suggested patching the pod spec (https://github.com/argoproj/argo/blob/master/examples/pod-spec-patch.yaml) with something like

"containers":[{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 20; argoexec wait"]}]
nfickas commented 4 years ago

@ddseapy So the above suggestion will just make it clean up the istio-proxy container after 20 seconds. I think the real issue is that the wait and main containers are launching before the proxy starts up.

ddseapy commented 4 years ago

My understanding is the suggestion should delay the wait container running by editing it's command. I would think that it would delay cleanup (as you said), but also prevent the startup race condition.

If the main container has the same race condition (depends on your main container), you would need to wait there too.

I haven't tried the approach myself yet though, so I may be misunderstanding things.

nfickas commented 4 years ago

hm yeah I am not seeing that happen, I am still seeing the main and wait containers startup before istio-proxy. For reference this is what I did: podSpecPatch: '{"containers": [{"name":"wait", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec wait"]}, {"name":"main", "command": ["/bin/bash", "-c"], "args": ["sleep 10; argoexec main"]}]}'

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy commented 4 years ago

I'd still want this open. This is valuable for kubeflow pipelines multi tenant support, because api server needs to be protected by istio. Having istio sidecars in workflows allow in cluster api calls from workflows.

shadiramadan commented 4 years ago

The startup problem should be resolved in Istio 1.7.0 as a workaround: https://istio.io/latest/news/releases/1.7.x/announcing-1.7/#production-operability-improvements https://github.com/istio/istio/pull/24737

If you are using the IstioOperator:

spec:
  components:
    proxy:
      holdApplicationUntilProxyStarts: true
BlackRider97 commented 3 years ago

Any one found any work around for Linkerd2 ?

arnoldrw commented 3 years ago

Recently we ran into not 1, but 3 separate race conditions while working with argo in a test environment with istio enabled.

I am posting this in hopes that it helps others, as certain aspects of this are covered in this issue but they are done through replies that can be hard to follow. In addition I believe case 3 may be a new one on this issue, and case 1 has a potential change to previous solutions.

Relevant details of our istio environment: -we have all traffic in/out through istio to control what can be hit -we have mtls enabled -not on 1.8 yet so can't use per pod holdApplicationUntilProxyStarts istio annotation

We defined a simple workflow that simply curled a host that had been whitelisted by istio. The workflow always contained this annotation:

metadata:
  annotations:
    sidecar.istio.io/inject: "true"

We were seeing race conditions in 3 areas

  1. The actual workflow job. If the istio proxy was not yet up, the curl would fail
  2. The argo wait container. This caused a cascading failure
  3. Writes to minio. These were seemingly randomly failing

Workflow job:

This one was the easiest. somesite.com had been whitelisted for egress and our initial worfklow had a command and args like this:

command: [sh, -c]
args: ["date; curl -v somesite.com"]

This resulted in an error like this in the main logs

<timestamp>
curl: (7) Failed to connect to somesite.com port 443: Connection refused

Next I grabbed the proxy logs for the container

kubectl logs <container> istio-proxy

From those, I could see that the envoy proxy was not up at the timestamp of the curl call. A log entry is made that states "Envoy proxy is ready" once it is up, so you can just grep for that.

Workflow job fix

The fix for the workflow job was to curl to check if the proxy was up, but we had a modification to what has previously been posted. Other posts curl localhost:15000 . What we found when doing that was that was the admin port, and a response from that did not mean the proxy was up. Instead we check the health status port:

args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl https://somesite.com"]

We added an extra sleep just in case - more testing needed there.

Argo wait container

This is similar to what nfickas mentioned with ddseapy's solution .

After we changed the workflow command we still faced issues. Specifically in the workflow ui we'd see a log about "Failed to establish pod watch: timed out waiting for the condition".

The important log for this was the wait log

kubectl logs <container> wait

This showed a log entry like this

time="2021-01-18T19:38:17.0651" level=error msg="executor error: Failed to establish pod watch: timed out waiting for the condition\ngithub.com/argoproj/argo/errors..........

The issue in this was also a race condition. Because all of our traffic was going through istio, the wait container needed istio to be up. Again, an inspection of the istio log to compare timestamps was the right next step

kubectl logs <container> istio-proxy | grep "Envoy proxy is ready"

This grep for the ready state of the proxy showed a timestamp that was seconds after the argo wait container tried to make its call.

Argo wait container fix

The fix for this was to exclude the k8 api ip range from being included in istio by adding to our annotations:

metadata:
  annotations:
    sidecar.istio.io/inject: "true"
    traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32

This ensured that the communication did not need istio to be up yet.

Minio issue

The last race condition we faced was on the write of logs to Minio (this is a test environment). This would occasionally fail and

kubectl logs <container> wait 

showed 503 errors when trying to connect to minio.

For this we needed to increase the logging level by modifying the command:

args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"]

A further inspection of the istio log

kubectl logs <container> istio-proxy

Showed the following

2021-01-18T19:51:50.461773Z debug   envoy connection    [external/envoy/source/common/network/connection_impl.cc:727] [C17] connecting to <ip>:9000
2021-01-2-181T19:51:50.461858Z  debug   envoy connection    [external/envoy/source/common/network/connection_impl.cc:736] [C17] connection in progress
2021-01-18T19:51:50.462871Z debug   envoy connection    [external/envoy/source/common/network/connection_impl.cc:592] [C17] connected
2021-01-18T19:51:50.462975Z debug   envoy connection    [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.466813Z debug   envoy connection    [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.466837Z debug   envoy connection    [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:191] [C17] handshake expecting read
2021-01-18T19:51:50.467951Z debug   envoy connection    [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:198] [C17] handshake error: 1
2021-01-18T19:51:50.467977Z debug   envoy connection    [external/envoy/source/extensions/transport_sockets/tls/ssl_socket.cc:226] [C17] TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED

This issue is tied to istio not updating certificates. This appears to be an istio/envoy issue related to certificate handling. Some example issues are here and here.

Minio fix

A temporary fix for this is to disable mtls for minio with PeerAuthentication and DestinationRule per the istio docs

Minio tls exclusion:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: minio-mtls-disable
  namespace: test
spec:
  selector:
    matchLabels:
      app: minio
  mtls:
    mode: UNSET
  portLevelMtls:
    9000:
      mode: DISABLE
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: minio
  namespace: test
spec:
  host: minio
  trafficPolicy:
    portLevelSettings:
    - port:
        number: 9000
      tls:
        mode: DISABLE

Full example workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: race-workflow
spec:
  entrypoint: test
  templates:
  - name: test
    steps:
    - - name: run-it
        template: curler
  - name: curler
    activeDeadlineSeconds: 60
    retryStrategy: 
      limit: 2
      retryPolicy: "Always" #OnFailure or OnError also are options
    script:
      image: python3.8-slim:latest
      command: [sh, -c]
      args: ["curl https://somesite.com/; until curl --head localhost:15021/healthz/ready ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; sleep 2; curl -s -XPOST http://localhost:15000/logging?level=debug; curl https://somesite.com"]
      imagePullPolicy: IfNotPresent
    metadata:
      annotations:
        sidecar.istio.io/inject: "true"
        traffic.sidecar.istio.io/excludeOutboundIPRanges: k8apiip/32

While isito 1.8 does include the holdApplicationUntilStarts annotation, I do wonder if this would solve case 2 with the wait container or not. It does seem like something to force argo to wait until the proxy is up would be generally useful.

Hope this helps someone.

alexec commented 3 years ago

Workflows v3 should just work with ISTIO. Please let me know if that is not the case.

Bobgy commented 3 years ago

May I confirm whether above statement is up to date or the documentation? https://argoproj.github.io/argo-workflows/sidecar-injection/

alexec commented 2 years ago

It should just work, but it doesn’t.

We’ve been speaking to ISTIO, because they have similar issues with other tools. Previously, our recommendation was to disable ISTIO.

We have since heard from users that the PNS executor works with ISTIO.

alexec commented 2 years ago

We should fix #7057.

alexec commented 2 years ago

Aside: it is very hard to cater from injected sidecars when container ordering is important to both you and the injected sidecar. I suspect that we'll always have problems with working with injected sidecars like ISTIO and Vault until we have a general way to kill them via API. #7057 is just a patch to the problem, and we may need more than one of these.

nfickas commented 2 years ago

I'll add that when my last company started using the emissary executor everything worked very well with istio. The only thing was having to find a way to kill the sidecars after the workflow ran but I think we solved that by reordering the pods startup somehow.

alexec commented 2 years ago

@nfickas can you share more? Do you have details we could share with other users?

nfickas commented 2 years ago

I can't remember the entire implementation details and no longer have access to those repos. But the gist of it comes down to using istio's holdApplicationUntilProxyStarts configuration.

This will cause the workflows execution to wait until the sidecars are able to serve traffic.

If I'm remembering this correctly just adding that annotation magically fixed our issues.

We had originally followed something similar to what is described in this article with limited success: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74

mattharrigan commented 2 years ago

Hello. First thanks for creating argo. I think its a great tool. I'm also struggling to use argo and istio. Using istio for me is also a company policy. IMO killing the sidecar is easy, but its almost impossible to get init containers working if they need network access. I would like to be able to use artifacts. My best thought with the current functionality is to have steps upload artifacts as normal, but when input artifacts are required bypass argo completely and get the files straight from S3 in the main container. Moderate PITA, but might work. Can you recommend anything better?

Longer term I think it would be great if the artifacts could be downloaded by the wait container itself. It could expose some http server and reply 200 when that is complete. The main container could wait on that signal. Just a thought.

RobCannon commented 1 year ago

I just wanted to say that the new Istio Ambient Mesh (GA in Istio 1.18) seems to solve the issues with Istio and Argo Workflows since there is no longer a sidecar.