airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.53k stars 4k forks source link

Unable to create pods after upgrade to 0.63.11 due to missing role/binding #42859

Open NAjustin opened 1 month ago

NAjustin commented 1 month ago

Helm Chart Version

0.350.0

What step the error happened?

Upgrading the Platform or Helm Chart

Relevant information

New App version: 0.63.11 Prior App Version: 0.63.9 Platform: GKE (Autopilot cluster)

Everything upgraded fine, but when trying to check or sync connections, we started getting errors like Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:airbyte-ns:REDACTED" cannot list resource "pods" in API group "" in the namespace "airbyte-ns", Guest attributes endpoint access is disabled, and "403 Forbidden" for request "PUT http://metadata.google.internal/computeMetadata/v1/instance/guest-attributes/guestInventory/Hostname"

It seems similar to some past threads:

I do already have all these set in config:

global:
  serviceAccountName: REDACTED

serviceAccount:
  create: true
  name: REDACTED

It seems like there may be a missing role binding or something along those lines. For what it's worth, we're using GKE Autopilot and a non-default service account (meaning not the default one provisioned for the cluster, and also not named airbyte-admin.

As a workaround, I granted our SA roles/container.clusterAdmin—but it really shouldn't need these permissions to create pods in its own deployments.

I saw another user reported a similar issue in this Slack thread.

Relevant log output

No response

ryanschwartz commented 1 month ago

+1 This happened for me as well, both going from 0.50.54 to 0.63.11 and also going from 0.63.8 to 0.63.11.

Role and RoleBinding were both deleted and not recreated. Upgrade --debug output:

client.go:486: [debug] Starting delete for "airbyte-admin" ServiceAccount
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "airbyte-admin-binding" RoleBinding
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "airbyte-admin-role" Role
client.go:142: [debug] creating 1 resource(s)

The --debug output did include the airbyte/templates/serviceaccount.yaml template output for both the Role and RoleBinding, so I was able to kubectl create the resources from that, but upgrade seems broken.

I'm running on GKE, helm version: version.BuildInfo{Version:"v3.13.1", GitCommit:"3547a4b5bf5edb5478ce352e18858d8a552a4110", GitTreeState:"clean", GoVersion:"go1.21.3"}

helm search repo airbyte/airbyte:


airbyte/airbyte                     0.363.0         0.63.11     Helm chart to deploy airbyte```
NAjustin commented 1 month ago

I'm guessing this is the offending change since it now runs pre-upgrade now but before was only pre-install: https://github.com/airbytehq/airbyte-platform/compare/v0.63.10...v0.63.11#diff-d0874cce592344af301414d17a2b74f107d9291a26f0205749fc8ac218ae2457

. . . but I'm seeing the same output as @ryanschwartz which shows the delete and create of 3 resources, but only the ServiceAccount object actually gets created (not the Role and RoleBinding)

marcosmarxm commented 1 month ago

@airbytehq/platform-deployments can someone take a look into this issue?

perangel commented 1 month ago

@NAjustin Thanks for reporting this. I'm going to test the fix and hopefully have something to get out shortly

pmossman commented 1 month ago

We're still investigating a proper fix, but as a potential workaround, running kubectl rollout restart deployment <release-name>-worker -n <namespace> after the problematic helm upgrade may get things working again by forcing a new worker pod to spin up with the recreated service account.

@ryanschwartz and @NAjustin if you give that a try and it gets things working again, do let us know as that will help us work out a proper solution!

ryanschwartz commented 1 month ago

@pmossman that will only restart the worker pod - manual intervention was needed to recreate the Role and RoleBinding for me, at which point the worker began functioning as expected.

KTamas commented 1 week ago

Baffled and disappointed how this is not a higher priority. This happened to us as well when upgrading and completely broke things. Had to fix it manually, the way @ryanschwartz suggested.

KTamas commented 1 week ago

For those who come after us, this is the yaml I've applied by hand, adjust values as needed:

# Source: airbyte/templates/serviceaccount.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: airbyte-admin-role
  namespace: airbyte
  labels:
    helm.sh/chart: airbyte-0.551.0
    app.kubernetes.io/name: airbyte
    app.kubernetes.io/instance: airbyte
    app.kubernetes.io/version: "0.64.3"
    app.kubernetes.io/managed-by: Helm
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: "-5"
rules:
  - apiGroups: ["*"]
    resources: ["jobs", "pods", "pods/log", "pods/exec", "pods/attach", "secrets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] # over-permission for now
---
# Source: airbyte/templates/serviceaccount.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: airbyte-admin-binding
  namespace: airbyte
  labels:
    helm.sh/chart: airbyte-0.551.0
    app.kubernetes.io/name: airbyte
    app.kubernetes.io/instance: airbyte
    app.kubernetes.io/version: "0.64.3"
    app.kubernetes.io/managed-by: Helm
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: "-3"
roleRef:
  apiGroup: ""
  kind: Role
  name: airbyte-admin-role
subjects:
  - kind: ServiceAccount
    name: airbyte-admin
justbeez commented 1 week ago

For what it's worth, @bgroff posted this comment earlier today in Slack:

We have rolled out a change that should help with the roll binding issue at the end of last week. We have another change that we will be landing in the next few days to make this work better.