Unable to create pods after upgrade to 0.63.11 due to missing role/binding

NAjustin commented 3 months ago

Helm Chart Version

0.350.0

What step the error happened?

Upgrading the Platform or Helm Chart

Relevant information

New App version: 0.63.11 Prior App Version: 0.63.9 Platform: GKE (Autopilot cluster)

Everything upgraded fine, but when trying to check or sync connections, we started getting errors like Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:airbyte-ns:REDACTED" cannot list resource "pods" in API group "" in the namespace "airbyte-ns", Guest attributes endpoint access is disabled, and "403 Forbidden" for request "PUT http://metadata.google.internal/computeMetadata/v1/instance/guest-attributes/guestInventory/Hostname"

It seems similar to some past threads:

I do already have all these set in config:

global:
  serviceAccountName: REDACTED

serviceAccount:
  create: true
  name: REDACTED

It seems like there may be a missing role binding or something along those lines. For what it's worth, we're using GKE Autopilot and a non-default service account (meaning not the default one provisioned for the cluster, and also not named airbyte-admin.

As a workaround, I granted our SA roles/container.clusterAdmin—but it really shouldn't need these permissions to create pods in its own deployments.

I saw another user reported a similar issue in this Slack thread.

Relevant log output

No response

ryanschwartz commented 3 months ago

+1 This happened for me as well, both going from 0.50.54 to 0.63.11 and also going from 0.63.8 to 0.63.11.

Role and RoleBinding were both deleted and not recreated. Upgrade --debug output:

client.go:486: [debug] Starting delete for "airbyte-admin" ServiceAccount
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "airbyte-admin-binding" RoleBinding
client.go:142: [debug] creating 1 resource(s)
client.go:486: [debug] Starting delete for "airbyte-admin-role" Role
client.go:142: [debug] creating 1 resource(s)

The --debug output did include the airbyte/templates/serviceaccount.yaml template output for both the Role and RoleBinding, so I was able to kubectl create the resources from that, but upgrade seems broken.

I'm running on GKE, helm version: version.BuildInfo{Version:"v3.13.1", GitCommit:"3547a4b5bf5edb5478ce352e18858d8a552a4110", GitTreeState:"clean", GoVersion:"go1.21.3"}

helm search repo airbyte/airbyte:


airbyte/airbyte                     0.363.0         0.63.11     Helm chart to deploy airbyte```

NAjustin commented 3 months ago

I'm guessing this is the offending change since it now runs pre-upgrade now but before was only pre-install: https://github.com/airbytehq/airbyte-platform/compare/v0.63.10...v0.63.11#diff-d0874cce592344af301414d17a2b74f107d9291a26f0205749fc8ac218ae2457

. . . but I'm seeing the same output as @ryanschwartz which shows the delete and create of 3 resources, but only the ServiceAccount object actually gets created (not the Role and RoleBinding)

marcosmarxm commented 3 months ago

@airbytehq/platform-deployments can someone take a look into this issue?

perangel commented 3 months ago

@NAjustin Thanks for reporting this. I'm going to test the fix and hopefully have something to get out shortly

pmossman commented 3 months ago

We're still investigating a proper fix, but as a potential workaround, running kubectl rollout restart deployment <release-name>-worker -n <namespace> after the problematic helm upgrade may get things working again by forcing a new worker pod to spin up with the recreated service account.

@ryanschwartz and @NAjustin if you give that a try and it gets things working again, do let us know as that will help us work out a proper solution!

ryanschwartz commented 3 months ago

@pmossman that will only restart the worker pod - manual intervention was needed to recreate the Role and RoleBinding for me, at which point the worker began functioning as expected.

KTamas commented 2 months ago

Baffled and disappointed how this is not a higher priority. This happened to us as well when upgrading and completely broke things. Had to fix it manually, the way @ryanschwartz suggested.

KTamas commented 2 months ago

For those who come after us, this is the yaml I've applied by hand, adjust values as needed:

# Source: airbyte/templates/serviceaccount.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: airbyte-admin-role
  namespace: airbyte
  labels:
    helm.sh/chart: airbyte-0.551.0
    app.kubernetes.io/name: airbyte
    app.kubernetes.io/instance: airbyte
    app.kubernetes.io/version: "0.64.3"
    app.kubernetes.io/managed-by: Helm
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: "-5"
rules:
  - apiGroups: ["*"]
    resources: ["jobs", "pods", "pods/log", "pods/exec", "pods/attach", "secrets"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] # over-permission for now
---
# Source: airbyte/templates/serviceaccount.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: airbyte-admin-binding
  namespace: airbyte
  labels:
    helm.sh/chart: airbyte-0.551.0
    app.kubernetes.io/name: airbyte
    app.kubernetes.io/instance: airbyte
    app.kubernetes.io/version: "0.64.3"
    app.kubernetes.io/managed-by: Helm
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-weight: "-3"
roleRef:
  apiGroup: ""
  kind: Role
  name: airbyte-admin-role
subjects:
  - kind: ServiceAccount
    name: airbyte-admin

justbeez commented 2 months ago

For what it's worth, @bgroff posted this comment earlier today in Slack:

We have rolled out a change that should help with the roll binding issue at the end of last week. We have another change that we will be landing in the next few days to make this work better.

dimisjim commented 1 month ago

This is still an issue at v0.64.7 app version (helm chart version: 0.654.0) (the last before v1)

Fixed with @KTamas' extra resources

gavin-ob commented 3 weeks ago

This is still an issue at v1.1.0 app version (helm chart version: 1.1.0)

Fixed with @KTamas extra resources.

For those unfamiliar with applying changes like that, you need to:

copy the text from the suggestion above into a file e.g. sa-role-update.yaml
update the relevant parts, for me this was the app and chart versions for both sections (1.1.0 for me), and the namespace to default which is where mine is running.
apply it to your cluster using kubectl e.g. kubectl apply -n default -f modules/helm/airbyte_temp/sa-roles.yml

Some people fear dying, others fear living... I fear airbyte upgrades

justbeez commented 3 weeks ago

@perangel Any update on this from the Airbyte side?

(I think a few of the issues people are running into in the Community Slack are also related, but they just don't know k8s well enough to articulate the problem.)

@gavin-ob Ouch (and here I told @marcosmarxm it was getting better 😂):

Some people fear dying, others fear living... I fear airbyte upgrades

zubairov commented 3 days ago

Same issue with upgrade to 1.1.1, thanks for @gavin-ob and @KTamas for describing the suggestion! When modifying the suggestion labels and annotations can be completely deleted they are not relevant to the success.

KTamas commented 2 days ago

It's been over two months since I posted that snippet, and this is getting really embarrassing for Airbyte, in my opinion, anyways.

airbytehq / airbyte