argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.94k stars 5.46k forks source link

Add support for GCP authentication for GKE #5958

Closed damadei-google closed 2 years ago

damadei-google commented 3 years ago

Summary

Add support for GCP authentication for GKE instead of keeping a token in a secret, pretty much similar to the AWS EKS support.

Motivation

Some GKE scenarios with multi cluster management can rely on Anthos Connect Gateway like seen here: https://cloud.google.com/anthos/multicluster-management/gateway/using. In such scenarios, the only authentication option is via GCP authentication, sending an OAuth token instead of a Kubernetes token.

This will open endless possibilities like having Argo CD in GCP orchestrating the deployment of many environments in multiple places (on-prem, GCP, other clouds, etc) via Anthos Connect Gateway.

Proposal

Leverage GCP authentication similar to what is done in aws eks get-token nowadays. I believe the K8S client-go already supports the auth plugin for GCP.

laloyalo commented 3 years ago

Sounds like a superset improvement to the one described in #3027, the connect gateway sounds interesting, thanks for sharing!

damadei-google commented 3 years ago

@laloyalo I'm trying to make it work on my end and then submit a PR. Hope to make it work.

laloyalo commented 3 years ago

Thanks @damadei-google! I will gladly review it and offer some help, if needed let me know!

damadei-google commented 3 years ago

@laloyalo

I've implemented what I believe is needed, however when I try to run the make test, I'm getting the following timeout. I've ran multiple times, same result and if I try to do this git command by hand it works. Any idea?

--- FAIL: TestVerifyCommitSignature (92.16s) git_test.go:287: Error Trace: git_test.go:287 Error: Received unexpected error: git fetch origin --tags --force failed timeout after 1m30s Test: TestVerifyCommitSignature git_test.go:293: Error Trace: git_test.go:293 Error: Received unexpected error: git checkout --force ae2d0ff0a6ac34dc3e8493b2ad45a9badc61a26c failed exit status 128: fatal: reference is not a tree: ae2d0ff0a6ac34dc3e8493b2ad45a9badc61a26c Test: TestVerifyCommitSignature git_test.go:301: Error Trace: git_test.go:301 Error: "error: 28027897aad1262662096745f2ce2d4c74d02b7f: unable to read file." does not contain "gpg: Signature made" Test: TestVerifyCommitSignature git_test.go:308: Error Trace: git_test.go:308 Error: Should be empty, but was error: 85d660f0b967960becce3d49bd51c678ba2a5d24: unable to read file. Test: TestVerifyCommitSignature FAIL

laloyalo commented 3 years ago

Hello @damadei-google , sorry I had some trouble setting up my dev env with minikube. I was able to run the make test command and it ran successfully. I am not sure if you ran it in a particular way. I also noticed yesterday I had some issues with the git command itself. Have you tried it today? Do you have a branch I could try out? Thanks!

zhang-xuebin commented 3 years ago

Hello @damadei-google , wondering is there any update on this thread? were you able to make it work with gcloud?

raman-nbg commented 3 years ago

@damadei-google is there any update? I'm quite new to argo, but would be interested to contribute somehow.

Matroxt commented 3 years ago

There has been a discussion here about this And even a PR here But the PR got closed without any obvious reason 😢 I was hoping it would be added to the v2.1 milestone, but that's obviously not happening. Not sure what needs to be done to push this for v2.2

raman-nbg commented 3 years ago

There is another discussion related to this: https://github.com/argoproj/argo-cd/discussions/6553

raman-nbg commented 3 years ago

I tried to extend the argocd image and add the gcloud SDK. But I got stuck somewhere...

My Dockerfile

FROM argoproj/argocd:latest

# See https://argoproj.github.io/argo-cd/operator-manual/custom_tools/#byoi-build-your-own-image
#  for instrucctions how to extend the Argo CD image to provide custom tooling.

# Switch to root for the ability to perform install
USER root

# Install gcloud cli
# See https://cloud.google.com/sdk/docs/downloads-interactive for details
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN curl https://sdk.cloud.google.com > install.sh
RUN bash install.sh --disable-prompts --install-dir=/usr/local/

# Switch back to non-root user
USER argocd

COPY get-gcloud-kube-config /usr/local/google-cloud-sdk/bin/get-gcloud-kube-config

# Add gcloud cli to path
ENV PATH $PATH:/usr/local/google-cloud-sdk/bin

The get-gcloud-kube-config script is a basic shell script:

#!/bin/bash

while getopts p:c:r: flag
do
    case "${flag}" in
        p) project=${OPTARG};;
        c) cluster=${OPTARG};;
        r) region=${OPTARG};;
    esac
done

gcloud config set project $project > /dev/null 2>&1
gcloud container clusters get-credentials $cluster --region $region > /dev/null 2>&1

if [ -z ${KUBECONFIG+x} ]; 
    then 
    cat $HOME/.kube/config

    else 
    cat $KUBECONFIG
fi

After that I tried to add a cluster config to argo:

apiVersion: v1
kind: Secret
metadata:
  name: my-gke-cluster
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
stringData:
  name: my-gke-cluster
  server: https://<ip-address>
  config: |
    {
      "execProviderConfig": {
        "command": "get-gcloud-kube-config",
        "args": [
          "-p",
          "<project-name>",
          "-c",
          "<cluster-name>",
          "-r",
          "europe-west3"
        ],
        "apiVersion": "client.authentication.k8s.io/v1alpha1",
        "installHint": "Could not find command 'get-gcloud-kube-config'"
      }
    }

But I always get the the following error

Failed to cache app resources: Get "https://<ip-address>/version?timeout=32s": 
  getting credentials: exec plugin is configured to use API version client.authentication.k8s.io/v1alpha1, 
  plugin returned version client.authentication.k8s.io/__internal

Maybe this is somehow related to https://github.com/argoproj/argo-cd/issues/6749?

raman-nbg commented 3 years ago

@jannfis can you give us some guidance here, please? Or at least do you know somebody who knows somebody...? :)

jessesuen commented 3 years ago

Does GKE have to work using an exec provider like AWS does? I thought not, because of the gcp auth provider that is compiled in the go-client:

import _ "k8s.io/client-go/plugin/pkg/client/auth/gcp"

But I notice that the codebase does not even import it. So maybe that is part of the problem. The following post seems to outline other things that may be needed to be done to get this to work:

https://www.fullstory.com/blog/connect-to-google-kubernetes-with-gcp-credentials-and-pure-golang/

piotrjanik commented 3 years ago

I have been using https://github.com/sl1pm4t/gcp-exec-creds with:

execProviderConfig:
  apiVersion: client.authentication.k8s.io/v1beta1
  command: gcp-exec-creds

Would love to see ArgoCD with native support for GCP's Service Accounts.

What is the preferred way to implement it? External exec somehow copied to container or built-in(Go) ?

RixTmobilender commented 3 years ago

EDIT: After a very painful debugging session, I realized I never annotated k8s service account with the required workload identity iam.gke.io/gcp-service-account key. It immediately sprung to life once I added it. This was a sad day for my productivity.

@piotrjanik Do you have a working example with gcp-exec-creds you could share? I can't for the life of me get the execProvider working 😅. I attempted to follow the steps in https://github.com/argoproj/argo-cd/discussions/6563 but I keep hitting the "the server has asked for the client to provide credentials" message.

My (jsonnet) configuration is as such:

config: {
  execProviderConfig: {
    apiVersion: "client.authentication.k8s.io/v1beta1", 
    command: "/kubecluster/goapps/bin/gcp-exec-creds", # pre-baked in custom image
    args: ["|", "sed", "-e", "s/ExecCredential/kind/", "-" ]
  },
  tlsClientConfig: {
    insecure: true,
  },
},

and my pre-baked Dockerfile:

FROM argoproj/argocd:v2.1.3

# Switch to root for the ability to perform install
USER root

RUN apt-get update && \
    apt-get install -y \
        curl \
        wget \
        && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

RUN mkdir -p /kubecluster/goapps && \
      cd /kubecluster && wget https://dl.google.com/go/go1.17.3.linux-amd64.tar.gz && \
      tar -C /kubecluster -xzf go1.17.3.linux-amd64.tar.gz && \
      export GOPATH=/kubecluster/goapps && \
      /kubecluster/go/bin/go install  github.com/sl1pm4t/gcp-exec-creds@latest && \
      chmod -R 777 /kubecluster && chown 999:999 /kubecluster

# Switch back to non-root user
USER 999

I have confirmed that the specified command + args in the pre-baked image produces:

{
  "apiVersion": "client.authentication.k8s.io/v1beta1",
  "kind": "ExecCredential",
  "status": {
    "token": "<redacted_token>"
  }
}

Yet, my cluster (in the argo UI) remains in Failed status. Has anyone got the execProvider working on GKE?

pjamenaja commented 2 years ago

Guys, I spent my entire day to make this works. I hope this will help saving time for the others.

  1. I created my own custom docker image below.
FROM argoproj/argocd:v2.1.7

# See https://argoproj.github.io/argo-cd/operator-manual/custom_tools/#byoi-build-your-own-image
#  for instrucctions how to extend the Argo CD image to provide custom tooling.

# Switch to root for the ability to perform install
USER root

ADD certs/*.crt /usr/local/share/ca-certificates 
RUN ls -lrt /usr/local/share/ca-certificates
RUN update-ca-certificates

# Install gcloud cli
# See https://cloud.google.com/sdk/docs/downloads-interactive for details
RUN apt-get update && \
    apt-get install -y curl && \
    apt-get install -y wget && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN curl https://sdk.cloud.google.com > install.sh
RUN bash install.sh --disable-prompts --install-dir=/usr/local/

RUN mkdir -p /kubecluster/goapps && \
    cd /kubecluster && wget https://dl.google.com/go/go1.17.3.linux-amd64.tar.gz && \
    tar -C /kubecluster -xzf go1.17.3.linux-amd64.tar.gz && \
    export GOPATH=/kubecluster/goapps && \
    /kubecluster/go/bin/go install  github.com/sl1pm4t/gcp-exec-creds@latest && \
    chmod -R 777 /kubecluster && chown 999:999 /kubecluster

# Switch back to non-root user
USER argocd

COPY get-gcloud-kube-config /usr/local/google-cloud-sdk/bin/get-gcloud-kube-config

# Add gcloud cli to path
ENV PATH $PATH:/usr/local/google-cloud-sdk/bin:/kubecluster/goapps/bin

RUN ls -lrt /usr/local/google-cloud-sdk/bin/get-gcloud-kube-config
RUN gcloud -v
  1. Create a custom bash script "get-gcloud-kube-config" to authen with gcloud
#!/bin/bash

while getopts p:c:r:z: flag
do
    case "${flag}" in
        p) project=${OPTARG};;
        c) cluster=${OPTARG};;
        r) region=${OPTARG};;
        z) zone=${OPTARG};;
    esac
done

gcloud config set project $project > /dev/null 2>&1
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS} > /dev/null 2>&1

if [ "${region}" != '' ]; then
    gcloud container clusters get-credentials $cluster --region $region > /dev/null 2>&1
else
    # Zone by default
    gcloud container clusters get-credentials $cluster --zone $zone > /dev/null 2>&1
fi

# Return ExecCredential to stdout - https://github.com/sl1pm4t/gcp-exec-creds
gcp-exec-creds
  1. In the ArgoCD secret I set it as below.
    apiVersion: v1
    kind: Secret
    metadata:
    name: gke-cluster1
    labels:
    argocd.argoproj.io/secret-type: cluster
    type: Opaque
    stringData:
    name:  gke-cluster1
    server: https://<change-this-to-the-cluster-ip>/
    config: |
    {
      "execProviderConfig": {
        "command": "get-gcloud-kube-config",
        "args": [
          "-p",
          "<change-this-to-your-gcp-project>",
          "-c",
          "<cluster-name-here>",
          "-z",
          "asia-southeast1-a"
        ],
        "apiVersion": "client.authentication.k8s.io/v1beta1",
        "installHint": "Could not find command 'get-gcloud-kube-config'"
      }
    }

Note : You will need to mount the GCP JSON key file to the pod's volume and also need to populate this env variable GOOGLE_APPLICATION_CREDENTIALS with the path of that JSON key file.

Thanks

romachalm commented 2 years ago

I have handled the GKE authentication with a slightly different approach. Instead of modifying the argocd image, I have transformed the gcp-exec-creds binary into a small API server that I deploy as a sidecar to the application controller container.

The deployment of the sidecar is made by patching the application controller kustomization. Because I run argocd in GKE, I have bound with workload identity the service account associated with the application controller (argocd-application-controller) to an IAM service account in GKE project.

In the projects hosting clusters that I need to deploy with argocd, I simply set the role/container.admin IAM policy to the previous service account.

Once this is configured, application controller has a full access to the target clusters. And finally, because the argocd image does not have curl, I use a small python line to request the creds to the side car, such as :

apiVersion: v1
kind: Secret
metadata:
  name: gke-cluster1
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
stringData:
  name:  gke-cluster1
  server: https://<gke endpoint>/
  config: |
    {
      "execProviderConfig": {
          "apiVersion": "client.authentication.k8s.io/v1beta1",
          "command": "python3",
          "args": [
            "-c",
            "import urllib.request; print(urllib.request.urlopen('http://localhost:8080/creds').read().decode('utf8'))"
          ]
        },
        "tlsClientConfig": {
          "caData": "XXXX",
          "insecure":false
      }
    }

I have just pushed the code and the kustomization examples of what I use in https://github.com/romachalm/gcp-exec-creds-sidecar

lusid commented 2 years ago

@romachalm Thank you so much for providing this solution. I've gotten it mostly working, but I've run into a small snag and I don't know exactly what is causing it, and I was hoping you might have some idea. At this point, I can do the following:

When I open Argo's GUI to the cluster screen and attempt to just edit/save the cluster information, it fails with this message in the GUI:

Unable to save changes: Get "https://[cluster IP redacted]/version?timeout=32s": getting credentials: exec: executable python3 failed with exit code 1

A similar message is shown if I just attempt to start deploying things to it. I'm not seeing any logs in the gcp-exec-creds sidecar container that shows it is ever hitting it and/or getting a token returned, and it seems like it is trying to do some verification of the endpoint existing first (which I've proven should work from inside the container's shell). I feel like I've tried everything at this point and I'm unsure of what else to check.

Interestingly enough, if I leave everything exactly as it is and just use the bearerToken from my local kubeconfig, it works perfectly. It also works if I manually set the bearerToken to the one that I retrieve from the container using your python command. So it seems like something specifically going wrong with the execProviderConfig setup that is causing it to never try to retrieve the token?

What is really strange to me is that changing the python line that it runs to simply print the result of what your service provides me manually actually works no problem... I don't understand how the python command can only fail when the Argo Controller is running it and when it is attempting to open the localhost:8080/creds url?

          args: [
            "-c",
            "print('{\"apiVersion\":\"client.authentication.k8s.io/v1beta1\",\"kind\":\"ExecCredential\",\"status\":{\"token\":\"ya29.c.KtwCHw... etc\"}}')"
          ]
olvesh commented 2 years ago

I ended up adding github.com/sl1pm4t/gcp-exec-creds to my argocd image.

apiVersion: v1
kind: Secret
metadata:
  name: my-gcp-cluster
  annotations:
    managed-by: argocd.argoproj.io
  labels:
    argocd.argoproj.io/secret-type: cluster
stringData:
  config: |
    {
      "tlsClientConfig": {
        "insecure": false,
        "caData": "[snip]"
      },
      "execProviderConfig": {
        "command": "gcp-exec-creds",
        "apiVersion": "client.authentication.k8s.io/v1beta1"
      }
    }
  name: [redacted]
  server: https://[redacted]

This seems to work well.

The argocd k8s service account is connected via workload identity to a GCloud service account on that has access to the other cluster via cluster-role-binding like this (note the extra User subject):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: argocd-manager-role-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: argocd-manager-role
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: argocd@[redacted].iam.gserviceaccount.com
- kind: ServiceAccount
  name: argocd-manager
  namespace: kube-system
lusid commented 2 years ago

I finally figured out what I was doing wrong. I'm using the helm charts provided here: https://github.com/argoproj/argo-helm/tree/master/charts/argo-cd

It splits the deployment into four parts: application-controller, repo-server, redis, and server

In my case, I needed to add the sidecar container to both application-controller (needed at the time of a deployment) and server (for the GUI to work?), which are deployed by default using two different service accounts (so both service accounts needed to be tied to the workload identity), and I had to set it to use a port that didn't conflict with either of them.

For now, this will get me by. I was trying to avoid having to modify the argocd image which is why I liked the sidecar approach, but I might end up switching eventually if I get around to automating the ArgoCD custom image build. Thanks!

AtzeDeVries commented 2 years ago

I've also succeeded using GKE and Workload Identity. I've been using the helm chart, but this should also work for kustomize. For the helm config i have:

controller:
    serviceAccount:
      annotations:
        iam.gke.io/gcp-service-account: <gcp-sa-name>@<projectid>.iam.gserviceaccount.com
    volumes:
      - name: custom-tools
        emptyDir: {}
    initContainers:
      - name: gcp-exec-credentials-installer
        image: golang:1.17-buster
        env:
          - name: GOPATH
            value: /custom-tools/go
        command: [sh, -c]
        args:
          - go install github.com/sl1pm4t/gcp-exec-creds@44ac497
        volumeMounts:
          - mountPath: /custom-tools
            name: custom-tools
    volumeMounts:
      - mountPath: /custom-tools
        name: custom-tools

And then the secret is:

apiVersion: v1
stringData:
  name: <cluster name>
  server: <cluster ip>
  config: |
    {
      "execProviderConfig": {
        "command": "/custom-tools/go/bin/gcp-exec-creds",
        "apiVersion": "client.authentication.k8s.io/v1beta1",
        "installHint": "Could not find command 'gcp-exec-creds'"
      },
        "tlsClientConfig": {
          "insecure": false,
          "caData": "redacted"
        }
    }
kind: Secret
metadata:
  labels:
    argocd.argoproj.io/secret-type: cluster
  name: <secret-name>
  namespace: argocd
type: Opaque

The gcp sa used by argo needs to have access to the remove cluster using gcp iam roles.

romachalm commented 2 years ago

@lusid that's strange. I only have the sidecar running on application-controller and not on server

argocd-application-controller-0                     2/2     Running    0          11h
argocd-server-585ddd6bb7-h4jgf                      1/1     Running    0          11h

And I have no error shown in the GUI, the status is sucessful. I have tested on version v2.2.0+6da92a8. What's yours ?

romachalm commented 2 years ago

@lusid I detected the issue you mentioned. TIL argocd-server also uses the cluster secret to fetch the logs from from the pods. So indeed the server needs also to run with the sidecar.

To do so, I had to modify the port of my sidecars to listen on port 7070 (simply adding the args: ["-port", "7070"] to the sidecars), because it conflicts with 8080 used by argocd-server. My exec cmd became :

"import urllib.request; print(urllib.request.urlopen('http://localhost:7070/creds').read().decode('utf8'))"

I also needed to bind the sa argocd-server to workload identity.

I have access again to the pods logs in the UI.

williamsmt commented 2 years ago

8032 seems promising for a long term approach

elebioda commented 2 years ago

I have applied the custom image and the gke binary and have it working... somewhat with multi-cluster using a connect gateway to authenticate argo to the cluster's api; however I am having 2 problems:

1) where if the connect gateway agent crashes all resources with the associated remote cluster are gone and the application controller does not attempt to re-watch the resources. I have to manually run invalidate cache for the resources to pop up again. I understand why they disappear when the agent crashes, but the agent does come back up and I would expect the application controller to retry the watch on all the resources.

2) Another interesting thing I am noticing is that I am trying to deploy an app of app pattern to the remote cluster and it deploy the application to the cluster, but the application controller does nothing with it. It see's it there, but it will not deploy the resources associated with the application to the cluster.

zhang-xuebin commented 2 years ago

Based on the nice work in #9190, here is an example with detailed instructions (based on Helm): https://github.com/zhang-xuebin/argo-helm/tree/host-on-gke is an example.

You can use ArgoCD to manage Private GKE clusters. VPC-peering, master Authorized Networking, etc, are NOT needed.

clive2000 commented 2 years ago

@zhang-xuebin Hello, I followed your guide and was able to connect to private GKE cluster using connect gateway.

However, in the last step when creating secret for external cluster. I need to change the server URL from server: https://connectgateway.googleapis.com/v1beta1/... to server: https://connectgateway.googleapis.com/v1/... in order to make it work.

zhang-xuebin commented 2 years ago

@clive2000 thanks for sharing the feedback. That's interesting, would you mind sharing what's the problem when you use https://connectgateway.googleapis.com/v1beta1/...? v1beta1 should behave very similar with v1.

laloyalo commented 1 year ago

Hello @zhang-xuebin , I checked the guide and repo, and it's great. I have a question, are the helm charts provided in the repo any different from the default helm charts on the official helm repos, or were they modified in any way to allow for the private connection to work? Thanks in advance!

zhang-xuebin commented 1 year ago

Hello @laloyalo , you can see the latest commit, https://github.com/argoproj/argo-helm/commit/458221674e44d080f6fabb3b1cfdadee4f28fe6f

Compared with official helm charts, the only difference is in ./charts/argo-cd/values.yaml, which is to enable Workload Identity by adding an annotations.

kartoch commented 1 year ago

A note of caution: the server format has changed:

For instance:

https://connectgateway.googleapis.com/v1/projects/814233161045/locations/global/gkeMemberships/my-company-dev