airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.48k stars 3.99k forks source link

GKE workload Identity #13081

Open userbradley opened 2 years ago

userbradley commented 2 years ago

Tell us about the problem you're trying to solve

I am trying to setup Airbyte in a secure manner on a GKE cluster running on Google cloud.

A it stands, you need to create a service account and keys, then base64 encode these values and store them as a secret in the Cluster.

apiVersion: v1
kind: Secret
metadata:
  name: gcs-log-creds
  namespace: default
data:
  gcp.json: ""

Describe the solution you’d like

Ideally I would like to use workload Identity, where we specify a service account that Airbyte uses on the cluster, which then impersonates and comes out the cluster as a GCP service account.

Describe the alternative you’ve considered or used

Simply not using the logging as it goes against our organizational policies of creating and exporting service account keys

Additional context

No

Are you willing to submit a PR?

Yes! I'm not 100% sure where I can help, perhaps with the KB writing!

Discourse post

https://discuss.airbyte.io/t/airbyte-using-fleet-workload-identity-overwrites-google-application-credentials-inside-connector/2277/1

Santhin commented 2 years ago

My solution was to use fleet Workload Identity

These links gave me glimpse to create a little PoC with a working workload identity https://cloud.google.com/anthos/fleet-management/docs/use-workload-identity https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/blob/31eb25ddfe20a8d38fd67e44bff9d5f16b6a503b/cloud-pubsub/deployment/pubsub-with-secret.yaml https://cloud.google.com/kubernetes-engine/docs/tutorials/authenticating-to-cloud-platform#config-connector

userbradley commented 2 years ago

Thanks @Santhin - The links you've provided (well at least this one, and this one ) are still using the key.json file

Can you share any modifications you needed to make to get Airbyte to work with workload ID over a SA key?

I am pretty familiar with K8's Workload Identity to GCP, we have a few deployments using them, but I'm unsure if Airbyte will work with it, as it seem to be expecting the key file.

Thoughts?

Santhin commented 2 years ago

Exactly this was the wall for me on how to use workload identity using key.json but the solution was to use Fleet workload identity which gives you the possibility to generate access token from Kubernetes service account.

Firstly u need to create sa:

resource "google_service_account" "sa_airbyte" {
  account_id = "airbyte-admin"
}
resource "google_project_iam_member" "sa_airbyte" {
  project = var.project
  role    = google_project_iam_custom_role.cr_airbyte.name
  member  = "serviceAccount:${google_service_account.sa_airbyte.email}"
}
resource "google_service_account_iam_member" "sa_airbyte" {
  service_account_id = google_service_account.sa_airbyte.id
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${var.project}.svc.id.goog[airbyte/airbyte-admin]"
}

I tested with a different name and account_id must match the account used inside helm chart which is airbyte-admin

Now we need to create json file with impersonated credentials I encourage you to follow this docs: https://cloud.google.com/anthos/fleet-management/docs/use-workload-identity#use_fleet_workload_identity This var.airbyte_gcs_log_creds_payload contains this json file:

{
      "type": "external_account",
      "audience": "identitynamespace:WORKLOAD_IDENTITY_POOL:IDENTITY_PROVIDER",
      "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/GSA_NAME@GSA_PROJECT_ID.iam.gserviceaccount.com:generateAccessToken",
      "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
      "token_url": "https://sts.googleapis.com/v1/token",
      "credential_source": {
        "file": "/secrets/tokens/gcp-ksa/token" <- in our example token gonna be mounted in this location screens below
      }
    }

With this json file we need to create kubernetes secret in my example it was something like this:

resource "kubernetes_manifest" "airbyte_gcs_log_creds" {
  manifest = {
    "apiVersion" = "v1"
    "data" = {
      "gcp.json" = base64encode(var.airbyte_gcs_log_creds_payload)
    }
    "kind" = "Secret"
    "metadata" = {
      "name" = "airbyte-airbyte-gcs-log-creds"
      "namespace" = "airbyte"
    }
  }
}

And now we gonna create ksa where we anotate our sa to ksa

(ksa - kubernetes service account)

Pls check this flag automountServiceAccountToken we want to mount our access token in different location so it's must have

resource "kubernetes_manifest" "ksa_airbyte_admin" {
  manifest = {
    "apiVersion" = "v1"
    "automountServiceAccountToken" = false
    "kind" = "ServiceAccount"
    "metadata" = {
      "annotations" = {
        "iam.gke.io/gcp-service-account" = var.sa_airbyte
      }
      "name" = "airbyte-admin"
      "namespace" = "airbyte"
    }
  }
}

In my values for helm charts

serviceAccount:
  create: false <- I don't want to create airbyte-admin with helm but with kubernetes manfiest 
global:
  logs:
    gcs:
      credentials: "/secrets/tokens/gcp-ksa/gcp.json" <- i make different path explenation later
    minio:
      enabled: true

server:
  extraVolumeMounts:
    - name: gcp-ksa
      mountPath: /secrets/tokens/gcp-ksa
      readOnly: true
  extraVolumes: 
    - name: gcp-ksa
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            path: token
            audience: playground-357914.svc.id.goog
            expirationSeconds: 172800
        - secret:
            name: airbyte-airbyte-gcs-log-creds

worker:
  extraVolumeMounts:
    - name: gcp-ksa
      mountPath: /secrets/tokens/gcp-ksa
      readOnly: true
  extraVolumes: 
    - name: gcp-ksa
      projected:
        defaultMode: 420
        sources:
        - serviceAccountToken:
            path: token
            audience: playground-357914.svc.id.goog
            expirationSeconds: 172800
        - secret:
            name: airbyte-airbyte-gcs-log-creds

And here is the example of mounted files: image Here you can see my mounted secret twice gcs-log-creds <- this is created from helm charts token <- my overwrite

image

@userbradley If you have more questions about this implementation feel free to ask I will try to create some simple example with a public repo with this because I've seen tons of threads about this.

Additional notes: I didn't test this with gcp connector, for example, bigquery. If we can use the same method for using impersonated json file rather than private key from service account It would be huge :D.

userbradley commented 2 years ago

@Santhin thanks for the comment, I'll try make sometime to look in to it.

Thought I'd just reply so you don't think I've ignored it - the team and I greatly appreciate your input and help!

Santhin commented 2 years ago

With this solution are some drawbacks or some additional goods it depends how you gonna look on this.

Using fleet workload identity which gonna mount GOOGLE_APPLICATION_CREDENTIALS to worker pod in case of trying to create connection / destination using bigquery you gonna encounter weird error while uploading credentials json Something like that: image

In first time I was confused why there is type external_account when I'm trying to enter normal credentials with type service_account

I connected the dots and the connector for bigquery is trying to use my GOOGLE_APPLICATION_CREDENTIALS from worker. And here is the question doing a small rewrite inside the connector to bigquery gonna give us the possibility to enter impersonation creds rather than normal?

Doing small digging I found https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java#L163 https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-bigquery/sample_secret/credentials.json

@userbradley here link to issue https://discuss.airbyte.io/t/airbyte-using-fleet-workload-identity-overwrites-google-application-credentials-inside-connector/2277

franviera92 commented 1 year ago

i need work airbyte with Workload identity, please add feature

yuriolive commented 1 year ago

{ "type": "external_account", "audience": "identitynamespace:WORKLOAD_IDENTITY_POOL:IDENTITY_PROVIDER", "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/GSA_NAME@GSA_PROJECT_ID.iam.gserviceaccount.com:generateAccessToken", "subject_token_type": "urn:ietf:params:oauth:token-type:jwt", "token_url": "https://sts.googleapis.com/v1/token", "credential_source": { "file": "/secrets/tokens/gcp-ksa/token" <- in our example token gonna be mounted in this location screens below } }

@Santhin What IDENTITY_PROVIDER should be for a GKE cluster? Couldn't find in the links.

Santhin commented 1 year ago

@yuriolive To retrieve values you can use gcloud container fleet memberships describe MEMBERSHIP, where MEMBERSHIP is your cluster's unique membership name in the fleet source

yuriolive commented 1 year ago

@yuriolive To retrieve values you can use gcloud container fleet memberships describe MEMBERSHIP, where MEMBERSHIP is your cluster's unique membership name in the fleet source

gcloud container fleet memberships list

The command doesn't return any membership. Are you using GKE too? You have to enable Anthos? Anthos has some cost involved so I would avoid if I could.