argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.39k stars 3.08k forks source link

Multi-cluster, multi-namespace workflows #3523

Open alexec opened 3 years ago

alexec commented 3 years ago

Summary

Run workflows across multiple clusters.

Motivation

So you only need to run one Argo Workflows installation. So you can run a workflow that has nodes in different clusters.

Proposal

Like Argo CD.

3516


Message from the maintainers:

If you wish to see this enhancement implemented please add a πŸ‘ reaction to this issue! We often sort issues this way to know what to prioritize.

luozhaoyu commented 3 years ago

Not only multi-cluster, shall I create another issue for multi-namespace support? This is a related issue: https://github.com/argoproj/argo/issues/2063#issuecomment-668211852 to install Argo Workflow in one namespace, but support creating pods in multiple namespaces (not Cluster installation, as the permission would be too broad)

CaramelMario commented 3 years ago

More details:

alexec commented 3 years ago

PoC findings:

What went well:

Questions raised:

alexec commented 3 years ago

I've created a dev build for people to test out multi-cluster workflows (and therefore prove demand for it)

argoproj/workflow-controller:multic

Instruction for use:

https://github.com/argoproj/argo/blob/399286fc1884bf20419de4b091766b29bbca7d94/docs/multi-cluster.md

Please let me know how you get on with this.

alexec commented 3 years ago

Please answer this poll: https://argoproj.slack.com/archives/C8J6SGN12/p1607041333397500

adrienjt commented 3 years ago

@alexec what do you think of this? https://admiralty.io/blog/2019/01/17/running-argo-workflows-across-multiple-kubernetes-clusters (the link was once listed on the Argo Workflows website)

The blog post is slightly outdated, as Admiralty uses Virtual-Kubelet and the Scheduler Framework now, but the use case still works. Admiralty creates a node that represents a remote cluster, makes multi-cluster workflows possible without any code change in the Argo project.

IMHO, multi-cluster is a common concern best treated separately. BTW, Admiralty also works with Argo CD.

alexec commented 3 years ago

Hi @adrienjt thank you - I just tweeted the post's author before realizing it was you. I'm aware that any first-class solution in Argo would be in competition with a multi-cluster scheduler as it would make the need moot. I'm also aware from working on Argo CD, that security with multi-cluster is difficult, because you end up with a single main cluster that has a lot of permissions..

alexec commented 3 years ago

I've updated the dev images during my white-space time today. You can test with these images:

Instructions

We really need to hear more concrete use case to progress this.

dudicoco commented 3 years ago

Isn't multi namespace already supported? I assume this could be done by using a cluster scope installation, but instead of creating a cluster role, create roles in each namespace you would like argo to have access to.

shadiramadan commented 3 years ago

We really need to hear more concrete use case to progress this.

@alexec For background on our use case:

We have 4 environments - each are separate clusters.

One is an 'operations' cluster that has argo-workflows installed. The rest are dev, staging, and production.

We have a workflow that updates multiple data stores with a lot of data.

Instead of 3 argo installations / UIs or instead of exposing endpoints to the data stores so they can be accessed by the operations argo workflow- I'd rather be able to run a workflow pod in a different cluster than argo is installed in so I can have one UI/Login with all my workflows that run in multiple clusters.

Right now we have to expose all these data stores and copy over a lot of the k8s secrets from the dev/staging/production clusters to the operations cluster in order for everything to work. I'd rather be able to just run a container in any connected cluster I specify.

roshpr commented 3 years ago

@alexec our usecase follows as below

We have a central master cluster which needs to connect to multiple different regional and edge K8 clusters to run different workflows depending on what workflows provisioned in our central master Argo server.

Right now we worked around by using git runners on each regional cluster to run some of our tasks. It is a cumbersome solution difficult to maintain and organize the sequence of tasks.

alexec commented 3 years ago

There are two main interpretations of "multi-cluster":

  1. πŸš€ A single installation of Argo Workflows in one cluster that runs workflows created in other clusters, and exposes a single user interface for all workflows in all clusters (#4684). It is possible to run workflows in multiple clusters today with multiple installations, so this is about simplifying management or Argo Workflow when you have many clusters.
  2. πŸ‘€ One or more installations that can run workflows where a single workflow can have two steps running in different namespaces (#2063?) and/or different clusters. This is not possible today, so this is about opening up new use cases.

As this is ambiguous, we don't actually know which of these you want (or both).

Can I ask you to vote by adding the appropriate reaction (πŸš€ / πŸ‘€ ) to this comment. Go further to demonstrate your interest by adding a comment with the use case you're trying to solve.

joyciep commented 3 years ago

Our use case for option 2:

Our workflow involves different steps involving different clusters. First few steps are for extracting and preprocessing the data in the first cluster, then the next step is to train the data in a separate cluster (with GPU) for machine learning purposes.

joshuajorel commented 3 years ago

@alexec point 2 might be more extensible in terms of scaling i.e. deploy workflow controllers in different namespaces and/or clusters and they communicate with a single argo server. Might also open the possibility of having workflow controllers outside kubernetes (VM deployments) since we might not want specialized hardware such as GPU machines to be part of a cluster.

servo1x commented 3 years ago

Where option 2 might be nice is where there is a secondary cluster for windows nodes. Our primary cluster (linux) uses a CNI not compatible with windows so we had to set up a separate cluster. It would be nice if our argo workflows that is on our primary server had the capability to schedule workloads on the secondary cluster for windows specific tasks.

Imagine in the case someone was using argo workflows for CI and was working in a monorepo for Linux and Windows docker images. Instead of having separate workflows, a single one with tasks that could be scheduled on the correct cluster, could open up a lot of interesting possibilities.

Guillermogsjc commented 3 years ago

Point 1 is the straightforward use case where you have several clients and cloud accounts with distinct clusters.

Managing workflows (UI+single client) specially Cron ones from one of the Argo installed on a "central" one would simplify a lot the work. It might be several Argos (one on each cluster) but to have an existing main one with an abstraction with credentials over the rest, maybe an easier option than trying to avoid existing Argos in the subrogated clusters

awcchungster commented 3 years ago

Our use case for option 2:

Our workflow involves different steps involving different clusters. First few steps are for extracting and preprocessing the data in the first cluster, then the next step is to train the data in a separate cluster (with GPU) for machine learning purposes.

From the machine learning perspective, this use case is increasingly popular. At AWS, I meet with many customers who are hybrid or multi cloud. The ability to run steps that transfer data, run a container in different clusters, merge final results, and manage all steps in a single interface is highly valuable.

@srivathsanvc

dllehr81 commented 2 years ago

To add to what's said about clusters having different hardware, we have a use with clusters being different architectures as well. ppc64le vs x86_64 but could be for any. We need to build packages on both, and publish to a single place.

Due to the nature of these packages using cross-compilers aren't an option, so we maintain two separate openshift clusters with their own argo instances etc. It would be nice to have a Workflow be able to schedule/track across clusters so we know when both sides have finished.

elidhu commented 2 years ago

Just a comment, can't the cases where there are different architectures / hardware requirements be achieved with the use of nodeSelector? you don't need to have to separate clusters to support this, you just need these nodes in the same cluster with appropriate labels.

caelan-io commented 2 years ago

Use cases for multi-cluster workflows that we have observed recently:

A. πŸ˜„ Automating a workflow that uses a variety of processing resources (eg, both CPU and GPU at different steps, uses specific AWS, GCP, Azure features at different steps, etc.)

B. πŸ‘€ Running and re-running a workflow across separate clusters that hold different client/ customer data

C. πŸš€ Running an extremely large workflow that requires sharding the workload across multiple clusters in order to complete the job and avoid hitting resource limits

D. πŸŽ‰ Automating a workflow that executes steps distributed across multiple cloud regions/ data centers. e.g. complying with GDPR-type data restrictions on where data must be stored; running workflows with those datasets spread across clusters in different regions

I'm curious if others are seeing these use cases. Perhaps upvote with the corresponding emoji if so!

mbruner commented 2 years ago

Our use case where multi-cluster might simplify architecture: we have a large number of simultaneously running workflows (tens of thousands). We don't fit into one region so we plan to run k8s clusters across two or three regions and here we have a few options how to handle balancing:

If workflow controller can handle it it will simplify greatly cluster topology, though it will make it much more loaded because it must handle all the workflows in one process.

One of the major issues we have with all listed setups is how to handle backpressure, especially when different regions have different capacity.

alexec commented 2 years ago

A multi-cluster PoC is ready for testing.

Guillermogsjc commented 2 years ago

This is an absolutely and madly powerful and great solution

shuker85 commented 2 years ago

Some notes with latest tag: I had to give access secrets rights to the argo-workflow-controller service account like

kubectl create role access-secrets --verb=get,list,watch,update,create --resource=secrets -n argo
kubectl create rolebinding --role=access-secrets default-to-secrets --serviceaccount=argo:argo-workflow-controller -n argo

previously log showed:

msg="failed to get kubeconfig secret: secrets \"kubeconfig\" is forbidden: User \"system:serviceaccount:argo:argo-workflow-controller\" cannot get resource \"secrets\" in API group \"\ β”‚
β”‚ " in the namespace \"argo\""

In kubeconfig secret itself i'm trying to connect to AWS cluster via aws command (not sure it is supposed to work though):

apiVersion: v1
clusters:
  - cluster:
      certificate-authority-data: aaaaaa==...
      server: https://AAAAF.yl4.us-east-2.eks.amazonaws.com
    name: arn:aws:eks:us-east-2:XYZ:cluster/test
contexts:
  - context:
      cluster: arn:aws:eks:us-east-2:XYZ:cluster/test
      user: arn:aws:eks:us-east-2:XYZ:cluster/test
      name: arn:aws:eks:us-east-2:XYZ:cluster/test
kind: Config
preferences: {}
users:
  - name: arn:aws:eks:us-east-2:XYZ:cluster/test
    user:
      exec:
        apiVersion: client.authentication.k8s.io/v1alpha1
        args:
          - --region
          - us-east-2
          - eks
          - get-token
          - --cluster-name
          - test
        command: aws
        env:
          - name: AWS_ACCESS_KEY_ID
            value: 121212
          - name: AWS_SECRET_ACCESS_KEY
            value: 1344444

Error:


time="2021-08-28T09:45:47Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2021-08-28T09:45:47Z" level=info msg="cron config" cronSyncPeriod=10s
time="2021-08-28T09:45:47.882Z" level=info msg="not enabling pprof debug endpoints"
time="2021-08-28T09:45:47.914Z" level=info msg="Get secrets 200"
E0828 09:45:47.915537       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1 [running[]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1d49520, 0x2fd6c40)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x2fd5c10, 0x1, 0x1)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1d49520, 0x2fd6c40)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.NewWorkflowController(0x22bb8f0, 0xc00014dc40, 0x22bc568, 0xc0003040a0, 0x22ea6a8, 0xc0005f5340, 0x2283ae0, 0xc00051f4b0, 0xc00014a9e0, 0x4, ...)
    /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:145 +0x1ce
main.NewRootCommand.func1(0xc0001dfb80, 0xc00011e780, 0x0, 0x8, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:104 +0x63b
github.com/spf13/cobra.(*Command).execute(0xc0001dfb80, 0xc00004c0a0, 0x8, 0x8, 0xc0001dfb80, 0xc00004c0a0)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001dfb80, 0xc00006c778, 0xc00010ff78, 0x406365)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:151 +0x2b
E0828 09:45:47.915583       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1 [running[]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1d49520, 0x2fd6c40)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x2fd5c10, 0x1, 0x1)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:51 +0xcb
panic(0x1d49520, 0x2fd6c40)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.NewWorkflowController(0x22bb8f0, 0xc00014dc40, 0x22bc568, 0xc0003040a0, 0x22ea6a8, 0xc0005f5340, 0x2283ae0, 0xc00051f4b0, 0xc00014a9e0, 0x4, ...)
    /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:145 +0x1ce
main.NewRootCommand.func1(0xc0001dfb80, 0xc00011e780, 0x0, 0x8, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:104 +0x63b
github.com/spf13/cobra.(*Command).execute(0xc0001dfb80, 0xc00004c0a0, 0x8, 0x8, 0xc0001dfb80, 0xc00004c0a0)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001dfb80, 0xc00006c778, 0xc00010ff78, 0x406365)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:151 +0x2b
panic: runtime error: invalid memory address or nil pointer dereference [recovered[]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1b375ee]
goroutine 1 [running[]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x2fd5c10, 0x1, 0x1)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1d49520, 0x2fd6c40)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.NewWorkflowController(0x22bb8f0, 0xc00014dc40, 0x22bc568, 0xc0003040a0, 0x22ea6a8, 0xc0005f5340, 0x2283ae0, 0xc00051f4b0, 0xc00014a9e0, 0x4, ...)
    /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:145 +0x1ce
main.NewRootCommand.func1(0xc0001dfb80, 0xc00011e780, 0x0, 0x8, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:104 +0x63b
github.com/spf13/cobra.(*Command).execute(0xc0001dfb80, 0xc00004c0a0, 0x8, 0x8, 0xc0001dfb80, 0xc00004c0a0)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001dfb80, 0xc00006c778, 0xc00010ff78, 0x406365)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:151 +0x2b

CM:

apiVersion: v1
data:
  cluster: main
  config: "containerRuntimeExecutor: emissary\nartifactRepository:\n  s3:\n    accessKeySecret:\n
    \     key: accesskey\n      name: \n    secretKeySecret:\n      key: secretkey\n
    \     name: \n    bucket: \n    endpoint: \n    insecure: true\nsso:\n  clientId:\n
    \   key: client-id\n    name: argo-workflows-sso\n  clientSecret:\n    key: client-secret\n
    \   name: argo-workflows-sso\n  issuer: https://argo-cd.dev.example.com/api/dex\n
    \ redirectUrl: https://argo-wf.dev.example.com/oauth2/callback\n  scopes:\n
    \ - groups\n  - email\n  - openid\n  sessionExpiry: 240h\n"
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"config":"containerRuntimeExecutor: emissary\nartifactRepository:\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: \n    secretKeySecret:\n      key: secretkey\n      name: \n    bucket: \n    endpoint: \n    insecure: true\nsso:\n  clientId:\n    key: client-id\n    name: argo-workflows-sso\n  clientSecret:\n    key: client-secret\n    name: argo-workflows-sso\n  issuer: https://argo-cd.dev.example.com/api/dex\n  redirectUrl: https://argo-wf.dev.example.com/oauth2/callback\n  scopes:\n  - groups\n  - email\n  - openid\n  sessionExpiry: 240h\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"workflow-controller","app.kubernetes.io/instance":"argo-wf","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"argo-workflows-cm","app.kubernetes.io/part-of":"argo-workflows","argocd.argoproj.io/instance":"argo-wf","helm.sh/chart":"argo-workflows-0.5.0"},"name":"argo-workflow-controller-configmap","namespace":"argo"}}
  creationTimestamp: "2021-08-20T14:37:02Z"
  labels:
    app.kubernetes.io/component: workflow-controller
    app.kubernetes.io/instance: argo-wf
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: argo-workflows-cm
    app.kubernetes.io/part-of: argo-workflows
    argocd.argoproj.io/instance: argo-wf
    helm.sh/chart: argo-workflows-0.5.0
  name: argo-workflow-controller-configmap
  namespace: argo
  resourceVersion: "7460842"
  uid: 20b10f52-007e-4e39-aa6f-e2472ec24883
alexec commented 2 years ago

@shuker85 your config requires the β€œaws” binary to be installed on the workflow controller image. I think you can mount a volume with the binary on it and set the PATH env var to point to it.

shuker85 commented 2 years ago

@alexec any hints how to accomplish that?

shuker85 commented 2 years ago

Hi @alexec, I've tried to use an initContainer in order to get the aws binary:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "9"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"workflow-controller","app.kubernetes.io/instance":"argo-wf","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"argo-workflows-workflow-controller","app.kubernetes.io/part-of":"argo-workflows","app.kubernetes.io/version":"v0.0.0-dev-mc-0","argocd.argoproj.io/instance":"argo-wf","helm.sh/chart":"argo-workflows-0.5.0"},"name":"argo-workflow-controller","namespace":"argo"},"spec":{"replicas":2,"selector":{"matchLabels":{"app.kubernetes.io/instance":"argo-wf","app.kubernetes.io/name":"argo-workflows-workflow-controller"}},"template":{"metadata":{"labels":{"app.kubernetes.io/component":"workflow-controller","app.kubernetes.io/instance":"argo-wf","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"argo-workflows-workflow-controller","app.kubernetes.io/part-of":"argo-workflows","app.kubernetes.io/version":"v0.0.0-dev-mc-0","helm.sh/chart":"argo-workflows-0.5.0"}},"spec":{"affinity":{"podAntiAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/name","operator":"In","values":["argo-workflows-workflow-controller"]}]},"topologyKey":"failure-domain.beta.kubernetes.io/zone"},"weight":100}]}},"containers":[{"args":["--configmap","argo-workflow-controller-configmap","--executor-image","quay.io/argoproj/argoexec:v0.0.0-dev-mc-0","--loglevel","info","--gloglevel","0"],"command":["workflow-controller"],"env":[{"name":"ARGO_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"LEADER_ELECTION_IDENTITY","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}}],"image":"quay.io/argoproj/workflow-controller:v0.0.0-dev-mc-0","imagePullPolicy":"Always","livenessProbe":{"failureThreshold":3,"httpGet":{"path":"/healthz","port":6060},"initialDelaySeconds":90,"periodSeconds":60,"timeoutSeconds":30},"name":"controller","ports":[{"containerPort":9090,"name":"metrics"},{"containerPort":6060}],"resources":{},"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]},"readOnlyRootFilesystem":true,"runAsNonRoot":true}}],"nodeSelector":{"kubernetes.io/os":"linux"},"serviceAccountName":"argo-workflow-controller"}}}}
  creationTimestamp: "2021-08-20T14:37:04Z"
  generation: 9
  labels:
    app.kubernetes.io/component: workflow-controller
    app.kubernetes.io/instance: argo-wf
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: argo-workflows-workflow-controller
    app.kubernetes.io/part-of: argo-workflows
    app.kubernetes.io/version: v0.0.0-dev-mc-0
    argocd.argoproj.io/instance: argo-wf
    helm.sh/chart: argo-workflows-0.5.0
  name: argo-workflow-controller
  namespace: argo
  resourceVersion: "8307045"
  uid: e043a62a-b7d7-4c98-acd4-1103d58881fa
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: argo-wf
      app.kubernetes.io/name: argo-workflows-workflow-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: workflow-controller
        app.kubernetes.io/instance: argo-wf
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: argo-workflows-workflow-controller
        app.kubernetes.io/part-of: argo-workflows
        app.kubernetes.io/version: v0.0.0-dev-mc-0
        helm.sh/chart: argo-workflows-0.5.0
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app.kubernetes.io/name
                      operator: In
                      values:
                        - argo-workflows-workflow-controller
                topologyKey: failure-domain.beta.kubernetes.io/zone
              weight: 100
      volumes:
        - name: aws-bin
          emptyDir: {}
      initContainers:
        - name: instal-aws-bin
          image: registry.opensuse.org/opensuse/tumbleweed:latest
          command: ["/bin/bash", "-c"]
          args:
            - |
              set -x
              echo "Installing AWS-CLI...";
              zypper -n in curl unzip which
              curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
              unzip -q awscliv2.zip
              ./aws/install
              cp $(which -a aws) /custom-tools/aws
              echo "Done.";
          volumeMounts:
            - mountPath: /custom-tools
              name: aws-bin
      containers:
        - args:
            - --configmap
            - argo-workflow-controller-configmap
            - --executor-image
            - quay.io/argoproj/argoexec:v0.0.0-dev-mc-0
            - --loglevel
            - debug
            - --gloglevel
            - "9"
          command:
            - workflow-controller
          env:
            - name: ARGO_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: LEADER_ELECTION_IDENTITY
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
          image: quay.io/argoproj/workflow-controller:v0.0.0-dev-mc-0
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: 6060
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 60
            successThreshold: 1
            timeoutSeconds: 30
          name: controller
          volumeMounts:
            - mountPath: /usr/bin/aws
              name: aws-bin
              subPath: aws
          ports:
            - containerPort: 9090
              name: metrics
              protocol: TCP
            - containerPort: 6060
              protocol: TCP
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
            runAsNonRoot: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        kubernetes.io/os: linux
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: argo-workflow-controller
      serviceAccountName: argo-workflow-controller
      terminationGracePeriodSeconds: 30

InitContainer log:


Installing AWS-CLI...
+ echo 'Installing AWS-CLI...'
+ zypper -n -q in curl unzip which
The following 3 NEW packages are going to be installed:
  curl unzip which
3 new packages to install.
Overall download size: 554.2 KiB. Already cached: 0 B. After the operation, additional 1.0 MiB will be used.
Continue? [y/n/v/...? shows all options] (y): y
+ curl https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -o awscliv2.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 42.2M  100 42.2M    0     0  85.9M      0 --:--:-- --:--:-- --:--:-- 86.0M
+ unzip -q awscliv2.zip
+ ./aws/install
You can now run: /usr/local/bin/aws --version
++ which -a aws
+ cp /usr/local/bin/aws /custom-tools/aws
+ echo Done.
Done.

WF-controller logs:


time="2021-08-29T19:33:58Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2021-08-29T19:33:58Z" level=info msg="cron config" cronSyncPeriod=10s
time="2021-08-29T19:33:58.047Z" level=info msg="not enabling pprof debug endpoints"
I0829 19:33:58.048197       1 merged_client_builder.go:121] Using in-cluster configuration
I0829 19:33:58.048476       1 merged_client_builder.go:163] Using in-cluster namespace
I0829 19:33:58.049316       1 round_trippers.go:425] curl -k -v -XGET  -H "User-Agent: workflow-controller/v0.0.0 (linux/amd64) kubernetes/$Format/argo-workflows/v0.0.0-dev-mc-0 argo-controller" -H "Authorization: Bearer <masked>" -H "Accept: application/json, */*" 'https://172.16.16.1:443/api/v1/namespaces/argo/secrets/kubeconfig'
time="2021-08-29T19:33:58.063Z" level=info msg="Get secrets 200"
I0829 19:33:58.063510       1 round_trippers.go:445] GET https://172.16.16.1:443/api/v1/namespaces/argo/secrets/kubeconfig 200 OK in 14 milliseconds
I0829 19:33:58.063521       1 round_trippers.go:451] Response Headers:
I0829 19:33:58.063526       1 round_trippers.go:454]     Cache-Control: no-cache, private
I0829 19:33:58.063530       1 round_trippers.go:454]     Content-Type: application/json
I0829 19:33:58.063534       1 round_trippers.go:454]     X-Kubernetes-Pf-Flowschema-Uid: 1664c2d8-01d8-48ff-9f10-21d83f7749e2
I0829 19:33:58.063538       1 round_trippers.go:454]     X-Kubernetes-Pf-Prioritylevel-Uid: 5b41ea25-f82c-4db8-8ee6-84695e8c001f
I0829 19:33:58.063543       1 round_trippers.go:454]     Content-Length: 3600
I0829 19:33:58.063547       1 round_trippers.go:454]     Date: Sun, 29 Aug 2021 19:33:58 GMT
I0829 19:33:58.063552       1 round_trippers.go:454]     Audit-Id: 731d63f6-44a8-41ac-a070-4b8f48c731ed
I0829 19:33:58.063628       1 request.go:1107] Response Body: {"kind":"Secret","apiVersion":"v1","metadata":{"name":"kubeconfig","namespace":"argo","uid":"ad32fa6f-b2cc-43a7-b253-da758e2d16ea","resourceVersion":"7450787","creationTimestamp":"2021-08-28T09:01:44Z","managedFields":[{"manager":"kubectl-create","operation":"Update","apiVersion":"v1","time":"2021-08-28T09:01:44Z","fieldsType":"FieldsV1","fieldsV1":{"f:data":{".":{},"f:value":{}},"f:type":{}}}]},"data":{"value":"11111111mRxdjlZYzZpRURtWWxRSlAwSlRXa0w5Cg=="},"type":"Opaque"}
E0829 19:33:58.065511       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1 [running[]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1d49520, 0x2fd6c40)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x2fd5c10, 0x1, 0x1)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:48 +0x86
panic(0x1d49520, 0x2fd6c40)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.NewWorkflowController(0x22bb8f0, 0xc0001419c0, 0x22bc568, 0xc00055e000, 0x22ea6a8, 0xc0003946e0, 0x2283ae0, 0xc000023880, 0xc00005aa40, 0x4, ...)
    /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:145 +0x1ce
main.NewRootCommand.func1(0xc0001a6780, 0xc000192100, 0x0, 0x8, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:104 +0x63b
github.com/spf13/cobra.(*Command).execute(0xc0001a6780, 0xc000142010, 0x8, 0x8, 0xc0001a6780, 0xc000142010)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a6780, 0xc00006c778, 0xc00059df78, 0x406365)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:151 +0x2b
E0829 19:33:58.065566       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 1 [running[]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1d49520, 0x2fd6c40)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x2fd5c10, 0x1, 0x1)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:51 +0xcb
panic(0x1d49520, 0x2fd6c40)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.NewWorkflowController(0x22bb8f0, 0xc0001419c0, 0x22bc568, 0xc00055e000, 0x22ea6a8, 0xc0003946e0, 0x2283ae0, 0xc000023880, 0xc00005aa40, 0x4, ...)
    /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:145 +0x1ce
main.NewRootCommand.func1(0xc0001a6780, 0xc000192100, 0x0, 0x8, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:104 +0x63b
github.com/spf13/cobra.(*Command).execute(0xc0001a6780, 0xc000142010, 0x8, 0x8, 0xc0001a6780, 0xc000142010)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a6780, 0xc00006c778, 0xc00059df78, 0x406365)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:151 +0x2b
panic: runtime error: invalid memory address or nil pointer dereference [recovered[]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1b375ee]
goroutine 1 [running[]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x2fd5c10, 0x1, 0x1)
    /go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/runtime/runtime.go:55 +0x109
panic(0x1d49520, 0x2fd6c40)
    /usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/argoproj/argo-workflows/v3/workflow/controller.NewWorkflowController(0x22bb8f0, 0xc0001419c0, 0x22bc568, 0xc00055e000, 0x22ea6a8, 0xc0003946e0, 0x2283ae0, 0xc000023880, 0xc00005aa40, 0x4, ...)
    /go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:145 +0x1ce
main.NewRootCommand.func1(0xc0001a6780, 0xc000192100, 0x0, 0x8, 0x0, 0x0)
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:104 +0x63b
github.com/spf13/cobra.(*Command).execute(0xc0001a6780, 0xc000142010, 0x8, 0x8, 0xc0001a6780, 0xc000142010)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0xc0001a6780, 0xc00006c778, 0xc00059df78, 0x406365)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
    /go/pkg/mod/github.com/spf13/cobra@v1.2.1/command.go:902
main.main()
    /go/src/github.com/argoproj/argo-workflows/cmd/workflow-controller/main.go:151 +0x2b

Which leads to https://github.com/argoproj/argo-workflows/blob/e7d0b7c8507fe635ca845a1f550eb866fb7d27b4/cmd/workflow-controller/main.go#L151

alexec commented 2 years ago

Ok. There was a bug. Can you try v0.0.0-dev-mc-1?

shuker85 commented 2 years ago

Thanks @alexec ,

level=fatal msg="Failed to register watch for controller config map: if you have an item in your config map named 'config', you must only have one item" 

More logs in https://gist.github.com/shuker85/eba5fb4452d9063adb42c39eb449f70b

alexec commented 2 years ago

Looks like you’re mixing old and new style configuration. Try manually editing the config map to fix it.

shuker85 commented 2 years ago

You're right, my workflow-controller-configmap has been populated by the community helm chart, where everything land under data.config. I've tried to adapt latest changes from https://argoproj.github.io/argo-workflows/workflow-controller-configmap.yaml Also since i've created the local namespace i think it lacked proper perms from (4) on the readme

kubectl -n remote apply -f https://raw.githubusercontent.com/argoproj/argo-workflows/master/manifests/quick-start/base/workflow-role.yaml
kubectl -n remote create sa workflow
kubectl -n remote create rolebinding workflow --role=workflow-role --serviceaccount=remote:workflow

I've done s/remote/local in this case.

latest error i got from trying to run the WF:

controller time="2021-08-30T14:41:09.901Z" level=info msg="Get leases 200"
controller time="2021-08-30T14:41:09.914Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:41:10.283Z" level=info msg="List workflowtasksets 403"
controller E0830 14:41:10.283688       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/reflector.go:167: Failed to watch *v1alpha1.WorkflowTaskSet: failed to list *v1alpha1.WorkflowTaskSet: workflowtasksets.argoproj.io is forbidden: User "system:serviceaccount:argo:argo-workflow-controller" cannot list resource "workflowtasksets" in API group "argoproj.io" at the cluster scope
controller time="2021-08-30T14:41:14.923Z" level=info msg="Get leases 200"
controller time="2021-08-30T14:41:14.946Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:41:16.564Z" level=info msg="List workflows 200"
controller time="2021-08-30T14:41:16.564Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
controller time="2021-08-30T14:41:19.955Z" level=info msg="Get leases 200"
...
controller time="2021-08-30T14:41:35.069Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:41:35.813Z" level=info msg="Processing workflow" namespace=local workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.823Z" level=info msg="Get configmaps 404"
controller time="2021-08-30T14:41:35.823Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
controller time="2021-08-30T14:41:35.823Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
controller time="2021-08-30T14:41:35.823Z" level=info msg="Updated phase  -> Running" namespace=local workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.823Z" level=info msg="Pod node multi-cluster-jf2qq initialized Pending" namespace=local workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.823Z" level=error msg="Recovered from panic" namespace=local r="runtime error: invalid memory address or nil pointer dereference" stack="goroutine 242 [running[]:\nruntime/debug.Stack(0xc037b16329, 0x1d49520, 0x2fd6c40)\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x9f\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate.func2(0xc0000ec0c0, 0x22bb9e8, 0xc000128000)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:192 +0xd1\npanic(0x1d49520, 0x2fd6c40)\n\t/usr/local/go/src/runtime/panic.go:971 +0x499\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).createWorkflowPod(0xc0000ec0c0, 0x22bb9e8, 0xc000128000, 0xc000c684e0, 0x13, 0xc0002df008, 0x1, 0x1, 0xc000d50280, 0xc0002defc8, ...)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/workflowpod.go:152 +0x1f9\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeContainer(0xc0000ec0c0, 0x22bb9e8, 0xc000128000, 0xc000c684e0, 0x13, 0xc000869120, 0x19, 0xc000d50280, 0x22a6670, 0xc0000ec480, ...)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:2373 +0x310\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).executeTemplate(0xc0000ec0c0, 0x22bb9e8, 0xc000128000, 0xc000c684e0, 0x13, 0x22a6670, 0xc0000ec480, 0xc000c206c0, 0x0, 0x0, ...)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:1813 +0x268e\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*wfOperationCtx).operate(0xc0000ec0c0, 0x22bb9e8, 0xc000128000)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/operator.go:342 +0xf1e\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).processNextItem(0xc000100800, 0x22bb9e8, 0xc000128000, 0x0)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:841 +0x830\ngithub.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).runWorker(0xc000100800)\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:763 +0x9b\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0001f36a0)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155 +0x5f\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0001f36a0, 0x22754c0, 0xc000baf9e0, 0x1, 0xc000115980)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156 +0x9b\nk8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0001f36a0, 0x3b9aca00, 0x0, 0x1, 0xc000115980)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133 +0x98\nk8s.io/apimachinery/pkg/util/wait.Until(0xc0001f36a0, 0x3b9aca00, 0xc000115980)\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:90 +0x4d\ncreated by github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).startLeading\n\t/go/src/github.com/argoproj/argo-workflows/workflow/controller/controller.go:389 +0x527\n" workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.823Z" level=info msg="Updated phase Running -> Error" namespace=local workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.823Z" level=info msg="Updated message  -> runtime error: invalid memory address or nil pointer dereference" namespace=local workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.823Z" level=info msg="Marking workflow completed" namespace=local workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.823Z" level=info msg="Checking daemoned children of " namespace=local workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.842Z" level=info msg="Update workflows 200"
controller time="2021-08-30T14:41:35.843Z" level=info msg="Workflow update successful" namespace=local phase=Error resourceVersion=8853138 workflow=multi-cluster-jf2qq
controller time="2021-08-30T14:41:35.846Z" level=info msg="Create events 201"
controller time="2021-08-30T14:41:35.861Z" level=info msg="Create events 201"
controller time="2021-08-30T14:41:40.075Z" level=info msg="Get leases 200"
controller time="2021-08-30T14:41:40.084Z" level=info msg="Update leases 200"
...
controller time="2021-08-30T14:41:55.157Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:41:59.790Z" level=info msg="List workflowtasksets 403"
controller E0830 14:41:59.790436       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/reflector.go:167: Failed to watch *v1alpha1.WorkflowTaskSet: failed to list *v1alpha1.WorkflowTaskSet: workflowtasksets.argoproj.io is forbidden: User "system:serviceaccount:argo:argo-workflow-controller" cannot list resource "workflowtasksets" in API group "argoproj.io" at the cluster scope
controller time="2021-08-30T14:42:00.163Z" level=info msg="Get leases 200"
controller time="2021-08-30T14:42:00.182Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:42:05.198Z" level=info msg="Get leases 200"
controller time="2021-08-30T14:42:05.212Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:42:09.994Z" level=info msg="Watch workflowtemplates 200"
controller time="2021-08-30T14:42:10.217Z" level=info msg="Get leases 200"
controller time="2021-08-30T14:42:10.234Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:44:52.936Z" level=info msg="Alloc=8142 TotalAlloc=63486 Sys=73809 NumGC=18 Goroutines=172"

controller time="2021-08-30T14:45:16.559Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=
controller time="2021-08-30T14:45:21.592Z" level=info msg="Get leases 200"
controller time="2021-08-30T14:45:21.614Z" level=info msg="Update leases 200"
controller time="2021-08-30T14:45:26.164Z" level=info msg="List workflowtasksets 403"
controller E0830 14:45:26.164677       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.20.4/tools/cache/reflector.go:167: Failed to watch *v1alpha1.WorkflowTaskSet: failed to list *v1alpha1.WorkflowTaskSet: workflowtasksets.argoproj.io is forbidden: User "system:serviceaccount:argo:argo-workflow-controller" cannot list resource "workflowtasksets" in API group "argoproj.io" at the cluster scope
alexec commented 2 years ago

This panic occurred because there is not context for your cluster. Check the kubeconfig contains a context with the same name as the cluster in your template. I'll add some extra diagnostics.

boshnak commented 2 years ago

Is there any update on this feature?

caelan-io commented 2 years ago

@boshnak - @JPZ13 and I have been working on a feature based on @alexec 's POC where users can run a workflow that has 2+ steps running in different clusters or namespaces. We're wrapping up the design phase for that now. This is different from a multi-cluster control plane, which is our day job at Pipekit.

Would you mind sharing your use case(s) and any requirements you have for multi-cluster workflows? If you're open to speaking live, I'm at c@pipekit.io and we can find a time to run through your requirements.

boshnak commented 2 years ago

@caelan-io Thanks for the prompt response. Our main use case is that we have 4 kubernetes clusters, and we would like to have single centrlized Argo workflow instance, from which we will be able to trigger workflows on all the clusters. So its a combination of having central argo server and the ability to trigger workflows on other clusters. I guess having specific steps triggered on different clusters does answer the majority of the use case.

dabenson4 commented 2 years ago

Hi @alexec thanks for working on this, been trying to make it work, I am not sure why the workflow pod is not spinning on the remote cluster, logs just indicate it is choosing the main(where argo is installed) cluster.

Logs from the workflow-controller seems to indicate the cluster is added and that it is reading the kubeconfig secret:

time="2022-03-07T23:09:47.244Z" level=info msg="starting pod informer" cluster=main labelSelector="workflows.argoproj.io/completed=false,!multi-cluster.argoproj.io/owner-cluster,!workflows.argoproj.io/controller-instanceid" managedNamespace=argo time="2022-03-07T23:09:47.244Z" level=info msg="starting pod informer" cluster=dgx-us labelSelector="workflows.argoproj.io/completed=false,multi-cluster.argoproj.io/owner-cluster=main,!workflows.argoproj.io/controller-instanceid" managedNamespace=argo

Notice how the pod indicates is going on cluster: main, instead of cluster: dgx-us as I am indicating on the wf:

time="2022-03-07T23:12:27.408Z" level=info msg="Processing workflow" namespace=argo workflow=multi-cluster-finaltest time="2022-03-07T23:12:27.411Z" level=info msg="Get configmaps 200" time="2022-03-07T23:12:27.411Z" level=info msg="resolved artifact repository" artifactRepositoryRef="argo/#" time="2022-03-07T23:12:27.411Z" level=info msg="Updated phase -> Running" namespace=argo workflow=multi-cluster-finaltest time="2022-03-07T23:12:27.411Z" level=info msg="Pod node multi-cluster-finaltest initialized Pending" namespace=argo workflow=multi-cluster-finaltest time="2022-03-07T23:12:27.411Z" level=info msg="creating workflow pod" cluster=main exists=false namespace=argo nodeID=multi-cluster-finaltest ownershipCluster=main podName=multi-cluster-finaltest time="2022-03-07T23:12:27.439Z" level=info msg="Create events 201" time="2022-03-07T23:12:27.452Z" level=info msg="Create pods 201" time="2022-03-07T23:12:27.457Z" level=info msg="Created pod: multi-cluster-finaltest (multi-cluster-finaltest)" namespace=argo workflow=multi-cluster-finaltest time="2022-03-07T23:12:27.457Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=multi-cluster-finaltest time="2022-03-07T23:12:27.457Z" level=info msg=reconcileAgentPod namespace=argo workflow=multi-cluster-finaltest time="2022-03-07T23:12:27.473Z" level=info msg="Update workflows 200" time="2022-03-07T23:12:27.474Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=253591298 workflow=multi-cluster-finaltest

NOTE: I am trying to run the workflows local on argo namespace as well as on the remote cluster.

Will much appreciate your guidance, I am using tag: v0.0.0-dev-mc-4

aychen99 commented 2 years ago

Hi @alexec , this feature looks really promising, thanks for working on it! Just curious though, roughly how long do you think it'll take to release this officially? My team is looking to make use of multi-cluster workflows in the future, so we'd appreciate any estimates you can provide.

GitHubxsy commented 2 years ago

I really like this feature. What is the stage of this feature now?

alexec commented 2 years ago

This is now available to try out:

Install from here:

https://github.com/argoproj/argo-workflows/releases/tag/v0.0.0-dev-mc-6

Read this to learn how to configure:

https://github.com/argoproj/argo-workflows/blob/dev-mc/docs/multi-cluster.md

alexec commented 2 years ago

Can I please ask everyone to complete this survey:

https://forms.gle/mxh9TLMG5mG8tZmc7

XiShanYongYe-Chang commented 2 years ago

Hi @alexec, I would like to confirm whether the newest version(v0.0.0-dev-mc-6) supports different steps of the same workflow running on different clusters.

I test on my local site, when install profile with two cluster member1 and member2, will produce error message like profile not found for policy argo,member2,default,1.

alexec commented 2 years ago

Hi @XiShanYongYe-Chang I need to update this, as I’ve changed the design. Hopefully today.

alexec commented 2 years ago

v0.0.0-dev-mc-7 is now ready for testing.

XiShanYongYe-Chang commented 2 years ago

@alexec thanks for your reply, I test with v0.0.0-dev-mc-7, it's pretty good to me. With this release, I can run workflow-a- on member1 cluster and workfolw-b- on member2 cluster separately.

Will this feature be supported in the one workflow, such as workflow-test-, where step-a runs in member1 cluster and step-b runs in member2 cluster?

alexec commented 2 years ago

Yes. That’s the primary intent.

JPZ13 commented 2 years ago

@alexec Is there a way to list the cluster profiles for debugging? I did the argo cluster get-profile cluster-1 ... command, and it was successful, but on running a workflow, I get error in entry template execution: profile not found for "cluster-1".

alexec commented 2 years ago

kubectl get secret -l workflows.argoproj.io/cluster will list the secrets profiles.

JPZ13 commented 2 years ago

I think I'm running into a namespace issue. kubectl get secret -l workflows.argoproj.io/cluster returns No resources found in default namespace. and kubectl get secret -l workflows.argoproj.io/cluster -n argo returns argo.profile.cluster-1 Opaque 1 86m. @alexec what namespace do you have the server and controller installed in, and which namespace is the profile in for you?

alexec commented 2 years ago

profile goes in the argo system namespace

alexec commented 2 years ago

Updated version for testing:

https://github.com/argoproj/argo-workflows/releases/tag/v0.0.0-dev-mc-8