CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.94k stars 593 forks source link

Operator does not find role 'pgo-target-role' #2334

Closed timbrd closed 3 years ago

timbrd commented 3 years ago

Describe the bug After successfully installing pgo 4.6.1 using the operator lifecycle manager as described here, I have found the following error messages in the operator logs:

[...]
time="2021-03-14T20:05:55Z" level=error msg="operator is unable to reconcile RBAC resource: roles.rbac.authorization.k8s.io \"pgo-target-role\" not found" func="internal/controller/manager.(*ControllerManager).reconcileRoleBindings()" file="internal/controller/manager/rbac.go:112" version=4.6.1
time="2021-03-14T20:06:55Z" level=error msg="operator is unable to reconcile RBAC resource: roles.rbac.authorization.k8s.io \"pgo-target-role\" is forbidden: user \"system:serviceaccount:pgo:postgres-operator\" (groups=[\"system:serviceaccounts\" \"system:serviceaccounts:pgo\" \"system:authenticated\"]) is attempting to grant RBAC permissions not currently held:\n{APIGroups:[\"apps\"], Resources:[\"replicasets\"], Verbs:[\"get\" \"list\" \"watch\" \"create\" \"patch\" \"update\" \"delete\" \"deletecollection\"]}" func="internal/controller/manager.(*ControllerManager).reconcileRoles()" file="internal/controller/manager/rbac.go:95" version=4.6.1
[...]

The deployment is running:

[root@b59e326509a5 postgres-operator]# kubectl get deployment -n pgo
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
postgres-operator   1/1     1            1           106m

Do I have to create the role manually?

To Reproduce Steps to reproduce the behavior:

  1. Create operator group and olm subscription as described in the docs
  2. Check the operator logs: kubectl logs postgres-operator-6df4f5746c-jq8ss operator -n pgo
jkatz commented 3 years ago

I would also suggest reviewing the Namespaces section of the documentation that detail the various namespaces and required privileges.

timbrd commented 3 years ago
  • Have you or someone on your system previously installed the Postgres Operator with a different method?

I have not used a different installation method to install the operator before, but I had to delete the olm install plan and crds since I have installed them into the wrong namespace.

  • Which namespace mode did you select?

I didn't select any namespace mode. I did only install the operator using the manifests as described in the mentioned documentation:

kubectl -n "$PGO_OPERATOR_NAMESPACE" create -f- <<YAML
---
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: postgresql
spec:
  targetNamespaces: ["$PGO_OPERATOR_NAMESPACE"]

---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: postgresql
spec:
  name: postgresql
  channel: stable
  source: operatorhubio-catalog
  sourceNamespace: olm
  startingCSV: postgresoperator.v4.6.1
YAML

I would also suggest reviewing the Namespaces section of the documentation that detail the various namespaces and required privileges.

Thanks, I will check that. But isn't the operator supposed to prepare the namespace and create the basic roles?

jkatz commented 3 years ago

Thanks, I will check that. But isn't the operator supposed to prepare the namespace and create the basic roles?

Yes, it does -- in fact, that is what that step is attempting to do. However, if your OpenShift cluster has certain permissions locked down, it may take a little extra effort.

Perhaps another question I should have asked first -- have you attempted to create a Postgres cluster after installing? That error may be the Operator just reporting that it does not have those permissions yet, and they are subsequently created after the reconciliation loop finishes.

aleksrosz commented 3 years ago

Hi @jkatz @timbrd I think that's the solution.

On Linux: mkdir -p $HOME/odev/src/github.com/crunchydata $HOME/odev/bin $HOME/odev/pkg cd $HOME/odev/src/github.com/crunchydata git clone https://github.com/CrunchyData/postgres-operator.git cd postgres-operator git checkout v4.6.1

/odev/src/github.com/crunchydata/postgres-operator/deploy/add-targeted-namespace.sh

Among others in this file is:

# create RBAC
$PGO_CMD -n $1 delete --ignore-not-found sa pgo-backrest pgo-default pgo-target
$PGO_CMD -n $1 delete --ignore-not-found role pgo-backrest-role pgo-target-role
$PGO_CMD -n $1 delete --ignore-not-found rolebinding pgo-backrest-role-binding pgo-target-role-binding

cat $PGO_CONF_DIR/pgo-configs/pgo-default-sa.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | $PGO_CMD -n $1 create -f -
cat $PGO_CONF_DIR/pgo-configs/pgo-target-sa.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | $PGO_CMD -n $1 create -f -
cat $PGO_CONF_DIR/pgo-configs/pgo-target-role.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | $PGO_CMD -n $1 create -f -
cat $PGO_CONF_DIR/pgo-configs/pgo-target-role-binding.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | sed 's/{{.OperatorNamespace}}/'"$PGO_OPERATOR_NAMESPACE"'/' | $PGO_CMD -n $1 create -f -
cat $PGO_CONF_DIR/pgo-configs/pgo-backrest-sa.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | $PGO_CMD -n $1 create -f -
cat $PGO_CONF_DIR/pgo-configs/pgo-backrest-role.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | $PGO_CMD -n $1 create -f -
cat $PGO_CONF_DIR/pgo-configs/pgo-backrest-role-binding.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | $PGO_CMD -n $1 create -f -
aceeric commented 3 years ago

I also am experiencing this - in my case with a fully declarative deployment: 1) subscribe the 4.6.1 operator from OLM. 2) create a 4.6.1 pgcluster custom resource. Same exact error in the logs regarding pgo-target-role. @jkatz the answer above would seem to indicate a requirement to run a shell script as part of the install. Our install needs to be fully declarative. Do you think the fully declarative solution as I've described should be creating this pgo-target-role ? It appears not to be...

I should add that we have been successfully using 4.5.1 up until now. The only changes in our configuration are the version changes from 4.5.1 to 4.6.1 in the OLM subscription manifest and the pgcluster manifest...

aceeric commented 3 years ago

@jkatz as a follow-up, I picked out and ran exactly two statements from @AleksanderRoszig's comment - after our fully declarative deployment:

cat $PGO_CONF_DIR/pgo-configs/pgo-target-role.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | $PGO_CMD -n $1 create -f - cat $PGO_CONF_DIR/pgo-configs/pgo-target-role-binding.json | sed 's/{{.TargetNamespace}}/'"$1"'/' | sed 's/{{.OperatorNamespace}}/'"$PGO_OPERATOR_NAMESPACE"'/' | $PGO_CMD -n $1 create -f -

... and the postgres operator errors immediately stopped. This would seem to indicate an error in the OLM subscription - the OLM CSV does not appear to define all roles and bindings needed by the operator.

jkatz commented 3 years ago

@aceeric Can you please provide a bit more information:

Our install needs to be fully declarative.

I'd also be curious if you could elaborate on the reason why it needs to be fully declarative. What is the use case you are trying to solve?

jkatz commented 3 years ago

I don't see anything immediately in the diff between v4.5.1 and v4.6.1 that strikes me that this is a bug, though I don't see anything to convince me that it is not a bug. I'll treat it as one for now and see if there is anything either programmatic or OLM-based that can fix this.

aceeric commented 3 years ago

@jkatz - our environment is air-gapped running Kubernetes. Therefore we have our own OLM registry serving the operator cloned from https://github.com/operator-framework/community-operators/tree/master/upstream-community-operators/postgresql. We use the ubi8-4.6.1 operator image with these manifests.

This approach has been solid since 4.4.1. So this issue exhibits on the move from 4.5.1 to 4.6.1 (we skipped 4.6.0 for no particular reason.) However, it is my belief that this could be reproduced in a non-air-gapped environment using these same manifests. Regarding the role in question - this is always an empty namespace that we are installing into.

And finally - our install needs to be fully declarative because it is deployed directly from source control using GitOps tooling.

I can see that the pgo-backrest-role and pgo-pg-role roles (and bindings) are created by the 4.6.1 operator, but not the pgo-target-role and associated binding...

jkatz commented 3 years ago

@aceeric Thanks for the additional info. That should help us drill into what's going on.

jkatz commented 3 years ago

@timbrd @AleksanderRoszig @aceeric This is indeed a bug and the fix will be applied to 4.6.2 in the coming days.

For now, the work around is what is suggested, which is to manually run the add-targeted-namespace.sh script.

Thanks for reporting!