Closed cortopy closed 4 years ago
This looks like it's a networking issue.
2020-03-24 18:19:49.156 UTC [194] DETAIL: Role "ccp_monitoring" does not exist. Connection matched pg_hba.conf line 7: "host all all 0.0.0.0/0 md5"
Authentication is failing for the ccp_monitoring
user because it did not initially exist. When it's created, subsequently there appears to be a different error with loading some of its JSON. I had recently tested it off of master
and things appear to load correctly. I'll check on REL_4_2
at some point.
For the other error:
[255]: Authentication failed.\nERROR: [056]: unable to find primary cluster - cannot proceed\n]
Authentication failed because it could not find the primary cluster.
Please try creating the cluster again.
@jkatz thanks for the speedy response! I have tried a few times already, but no success. Have you seen the logs for the database container? The request from pgbackrest does reach the pod, and so that's why I discarded the networking possibility
I see this:
2020-03-24 18:21:53.280 UTC [191] DETAIL: The failed archive command was: source /tmp/pgbackrest_env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001" ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
which is not pgBackRest reaching the PostgreSQL Pod, but rather PostgreSQL trying to execute the WAL archive command and not being able to reach the pgBackRest pod.
Does your environment block SSH by chance?
I don't think so. This is a staging self-managed cluster with Cilium as CNI but I haven't configured any network policies for this
If possible, I would recommend disabling Cilium and then trying to deploy a cluster to see if it works.
Having not heard back, I believe this is an issue with the Cilium configuration. If you are able to investigate more and report back, please let us know. In the interim I am closing this.
Same issue, same error. Also 4.2.2.
I use Network Policies managed by Calico, but I fully enabled ingress and egress traffic inside the cluster namespace. I also don't think this is a network problem, seem to be something with SSH.
After reading the issue https://github.com/pgbackrest/pgbackrest/issues/895 I'm under the impression either container expects a password to SSH and the lack of it is causing the problem.
Bellow there is what I'm doing. Stanza crashes every time...
Logs:
time="2020-05-07T00:50:32Z" level=info msg="pgo-backrest starts"
time="2020-05-07T00:50:32Z" level=info msg="debug flag set to false"
time="2020-05-07T00:50:32Z" level=info msg="backrest stanza-create command requested"
time="2020-05-07T00:50:32Z" level=info msg="command to execute is [pgbackrest stanza-create --db-host=10.3.0.59 --db-path=/pgdata/website-cluster]"
time="2020-05-07T00:50:32Z" level=info msg="command is pgbackrest stanza-create --db-host=10.3.0.59 --db-path=/pgdata/website-cluster "
time="2020-05-07T00:50:34Z" level=error msg="command terminated with exit code 56"
time="2020-05-07T00:50:34Z" level=info msg="output=[]"
time="2020-05-07T00:50:34Z" level=info msg="stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.3.0.59' terminated unexpectedly [255]: Authentication failed.\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
time="2020-05-07T00:50:34Z" level=error msg="command terminated with exit code 56"
If it helps, here is my pgo.yaml
Cluster:
PrimaryNodeLabel: part-of=default
ReplicaNodeLabel: part-of=default
CCPImagePrefix: crunchydata
Metrics: true
Badger: false
CCPImageTag: centos7-12.2-4.2.2
Port: 5432
PGBadgerPort: 10000
ExporterPort: 9187
User: postgres
Database: website
PasswordAgeDays:
PasswordLength: 24
Policies:
Strategy: 1
Replicas: 1
ArchiveMode: false
ArchiveTimeout: 60
ServiceType: ClusterIP
Backrest: true
BackrestPort: 2022
BackrestS3Bucket:
BackrestS3Endpoint:
BackrestS3Region:
DisableAutofail: false
LogStatement: none
LogMinDurationStatement: 60000
PodAntiAffinity: preferred
SyncReplication: true
PrimaryStorage: custom
BackupStorage: gce
ReplicaStorage: gce
BackrestStorage: gce
Storage:
custom:
AccessMode: ReadWriteOnce
Fsgroup: 26
MatchLabels:
Size: 25G
StorageClass: persistent-pd
StorageType: dynamic
SupplementalGroups:
gce:
AccessMode: ReadWriteOnce
Size: 25G
StorageType: dynamic
StorageClass: standard
Fsgroup: 26
DefaultContainerResources: custom
DefaultLoadResources:
DefaultLspvcResource:
DefaultRmdataResources:
DefaultBackupResources:
DefaultPgbouncerResources:
ContainerResources:
custom:
RequestsMemory: 256Mi
RequestsCPU: 0.1
LimitsMemory: 1Gi
LimitsCPU: 0.5
Pgo:
PreferredFailoverNode:
Audit: false
PGOImagePrefix: crunchydata
PGOImageTag: centos7-4.2.2
I then issue these commands via a Bash script, edited for brevity:
# Create
pgo create cluster my-cluster --namespace=cluster-ns
# Wait for completion
kubectl -n cluster-ns rollout status -w my-cluster
# Enable autofail
echo yes | pgo update cluster my-cluster --enable-autofail --namespace=cluster-ns
# Enable backups
pgo create schedule my-cluster --schedule="0 4 * * *" --schedule-type=pgbackrest --pgbackrest-backup-type=full --schedule-opts="--repo1-retention-full=7" --namespace=cluster-ns
The containers are using passwordless SSH keys. This sounds like something with your Calico configuration.
The telltale error is this:
[056]: unable to find primary cluster - cannot proceed
The reason for the "authentication failed" error is because the stanza create job cannot connect to the pgBackRest repository.
Hello @jkatz !
OK, I'll debug the Network Policies.
In the meantime, the stanza Pod is trying to connect via port 2022 to the IP of the PostgreSQL master Pod, which have some containers inside.
As far as I can see, there is no container declaring this port though.
When I describe the pod, I get ports 5432
and 8009
(both TCP) for the database
container, port 9187
(TCP) for the collect
container.
Maybe my configuration is missing something? Where would the SSH connection go?
2022 exists both for the pgBackRest repository to connect to a PostgreSQL pod, and for the PostgreSQL pod to push archives to the pgBackRest repository. For example, in my cluster called hippo
:
PostgreSQL
kubectl -n pgo describe service hippo
yields
Name: hippo
Namespace: pgo
Labels: name=hippo
pg-cluster=hippo
vendor=crunchydata
Annotations: <none>
Selector: pg-cluster=hippo,role=master
Type: ClusterIP
IP: 10.96.145.62
Port: pgbadger 10000/TCP
TargetPort: 10000/TCP
Endpoints: 10.44.0.3:10000
Port: postgres-exporter 9187/TCP
TargetPort: 9187/TCP
Endpoints: 10.44.0.3:9187
Port: sshd 2022/TCP
TargetPort: 2022/TCP
Endpoints: 10.44.0.3:2022
Port: patroni 8009/TCP
TargetPort: 8009/TCP
Endpoints: 10.44.0.3:8009
Port: postgres 5432/TCP
TargetPort: 5432/TCP
Endpoints: 10.44.0.3:5432
Session Affinity: None
Events: <none>
pgBackRest Repository
kubectl -n pgo describe service hippo-backrest-shared-repo
yields
Name: hippo-backrest-shared-repo
Namespace: pgo
Labels: name=hippo-backrest-shared-repo
pg-cluster=hippo
pgo-backrest-repo=true
vendor=crunchydata
Annotations: <none>
Selector: name=hippo-backrest-shared-repo
Type: ClusterIP
IP: 10.96.168.67
Port: <unset> 2022/TCP
TargetPort: 2022/TCP
Endpoints: 10.44.0.2:2022
Session Affinity: None
Events: <none>
I just updated pgo to 4.3.0 but to no avail.
I can confirm the backrest pod declares port 2022
but it's still absent from the postgresql pod...
Going over cluster-deployment.json I see there is a volume for sshd
, but no port.
@jkatz could you send all hippo pods descriptions? The services I create are like the ones you posted.
After thinking about why the CNI causes the issue, maybe it just enforces that the Pods need to declare each port they use?
If this assumption is correct, it would indeed be a network problem, since Calico blocks traffic to a port it does not recognize... Since the stanza pod knows which IP to use, sending traffic directly would be a problem.
That's the only explanation I can think of. Other consequences would be those other ports in the service wouldn't receive traffic either.
In GKE, I can confirm that without Network Policies or Pod Security Policies everything works...
I'll try to dig deeper into what's going on with Calico or if it's some PSP misconfiguration and post it here.
@jkatz I found the source of the error, it's not related to Calico (although you are right that a correct network policy must be in place), it's related to the Pod Security Policy and the lack of some permissions to the backrest service account.
In my cluster, this is the PSP that gets attributed to the job. Can you help me with information regarding what option is missing? Should be related to the network, I guess...
apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
name: zz-default
spec:
allowPrivilegeEscalation: false
fsGroup:
ranges:
- max: 65535
min: 1
rule: MustRunAs
requiredDropCapabilities:
- ALL
runAsUser:
rule: MustRunAsNonRoot
seLinux:
rule: RunAsAny
supplementalGroups:
ranges:
- max: 65535
min: 1
rule: MustRunAs
volumes:
- configMap
- emptyDir
- projected
- secret
- downwardAPI
- persistentVolumeClaim
@davi5e Try setting your fsGroup
to 26:
fsGroup:
ranges:
- max: 26
min: 26
I'm not sure if you're using any supplemental groups, you may not need that section, but the "postgres" process expects the files to be owned by UID 26, hence the fsGroup setting.
This may also help: https://github.com/CrunchyData/postgres-operator/blob/master/deploy/pgo-scc.yaml
Thank you @jkatz !
This file, after converting from Openshift to a PSP, really worked.
If anyone ever reads this, I changed the user to RunAsAny and created a RoleBinging to every service account PGO creates (one in the operator namespace, four in each observed namespace), all pointing to the modified PSP YAML above (spec here for v1.15). For me, this is secure enough.
Maybe this should go into the documentation somewhere? I read this link about security but the Pod Security Policy is absent...
@davi5e Glad to hear it's working. We probably should augment the security documentation. Patches and contributions are certainly welcome :wink:
I am facing similar issue with pgo 4.4.0, I can't seem to figure out what's being blocked port 2022 seems to be open but how do I confirm?
pgo create cluster bpg2 -n pgo
kubectl get pods -npgo
NAME READY STATUS RESTARTS AGE
bpg2-backrest-shared-repo-6947dd5f4-xzj9m 1/1 Running 0 65m
bpg2-c7d459fc7-pmklb 1/1 Running 1 64m
bpg2-stanza-create-2glzl 0/1 Error 0 64m
bpg2-stanza-create-9zfzr 0/1 Error 0 63m
bpg2-stanza-create-gbbrm 0/1 Error 0 63m
bpg2-stanza-create-mk5ww 0/1 Error 0 64m
bpg2-stanza-create-s4cbh 0/1 Error 0 62m
pgo-deploy-p29sc 0/1 Completed 0 5h37m
postgres-operator-5d5ff486c7-6pcjm 4/4 Running 0 5h36m
Job Log:
time="2020-07-24T23:21:19Z" level=info msg="pgo-backrest starts"
time="2020-07-24T23:21:19Z" level=info msg="debug flag set to false"
time="2020-07-24T23:21:19Z" level=info msg="backrest stanza-create command requested"
time="2020-07-24T23:21:19Z" level=info msg="command to execute is [pgbackrest stanza-create --db-host=10.42.5.228 --db-path=/pgdata/bpg2]"
time="2020-07-24T23:21:19Z" level=info msg="command is pgbackrest stanza-create --db-host=10.42.5.228 --db-path=/pgdata/bpg2 "
time="2020-07-24T23:21:19Z" level=error msg="command terminated with exit code 56"
time="2020-07-24T23:21:19Z" level=info msg="output=[]"
time="2020-07-24T23:21:19Z" level=info msg="stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.42.5.228' terminated unexpectedly [255]: Authentication failed.\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
time="2020-07-24T23:21:19Z" level=error msg="command terminated with exit code 56"
Command from pgbackrest pod:
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ pgbackrest stanza-create --db-host=10.42.5.228 --db-path=/pgdata/bpg2
WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.42.5.228' terminated unexpectedly [255]: Authentication failed.
ERROR: [056]: unable to find primary cluster - cannot proceed
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh 10.42.5.228 2022
Password:
Password:
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh 10.42.5.228 222
Password:
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh 10.42.5.228 22
Password:
^C
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh postgres@10.42.5.228 22
Authentication failed.
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh root@10.42.5.228 22
Password:
^C
kubectl -npgo describe service bpg2
Name: bpg2
Namespace: pgo
Labels: name=bpg2
pg-cluster=bpg2
vendor=crunchydata
Annotations: <none>
Selector: pg-cluster=bpg2,role=master
Type: ClusterIP
IP: 10.43.69.159
Port: sshd 2022/TCP
TargetPort: 2022/TCP
Endpoints: 10.42.5.228:2022
Port: postgres 5432/TCP
TargetPort: 5432/TCP
Endpoints: 10.42.5.228:5432
Session Affinity: None
Events: <none>
@jkatz this link expired (you posted it on 12 May): https://github.com/CrunchyData/postgres-operator/blob/master/deploy/pgo-scc.yaml Could you provide the new one please?
As of Operator v4.3.1, the pgo-scc
is no longer required if you're running in restricted
mode:
https://github.com/CrunchyData/postgres-operator/releases/tag/v4.3.1
- Introduce DISABLE_FSGROUP option as part of the installation. When set to true, this does not add a FSGroup to the Pod Security Context when deploying PostgreSQL related containers or pgAdmin 4. This is helpful when deploying the PostgreSQL Operator in certain environments, such as OpenShift with a restricted Security Context Constraint. Defaults to false.
- Remove the custom Security Context Constraint (SCC) that would be deployed with the PostgreSQL Operator, so now the PostgreSQL Operator can be deployed using default OpenShift SCCs (e.g. "restricted", though note that DISABLE_FSGROUP will need to be set to true for that). The example PostgreSQL Operator SCC is left in the examples directory for reference.
https://github.com/CrunchyData/postgres-operator/blob/master/examples/pgo-scc.yaml
@davi5e Could you please share the PSP you created? i tried to apply it to my operator but i can not fix the issue. I'm also running calico but there are no networkpolicies in place.
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
annotations:
kubernetes.io/description: Policy to allow PGO to function properly.
seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
name: pgo-operator
spec:
allowPrivilegeEscalation: true
defaultAllowPrivilegeEscalation: true
fsGroup:
ranges:
- max: 2
min: 2
- max: 26
min: 26
rule: MustRunAs
requiredDropCapabilities:
- KILL
- MKNOD
- SETGID
- SETUID
runAsUser:
rule: MustRunAsNonRoot
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: psp-pgo
rules:
- apiGroups:
- extensions
resourceNames:
- pgo-operator
resources:
- podsecuritypolicies
verbs:
- use
Then you create a RoleBinding
for serviceaccount/postgres-operator
over PGO's namespace and 4 on each target namespace to account for serviceaccount/pgo-backrest
, serviceaccount/pgo-default
, serviceaccount/pgo-pg
, and serviceaccount/pgo-target
.
Hope it helps.
OBS: The code above is subject to changes if pgo-scc.yaml ever changes... It's mainly a simple one-to-one conversion.
@davi5e If this is something that can be more generally helpful, we can find a way to include it in the documentation, my guess being somewhere in the installation section in a part about considerations for OpenShift?
...though I may throw some caution to the wind here, as there is presently discussion Pod Security Policies may be going away: https://www.antitree.com/2020/11/pod-security-policies-are-being-deprecated-in-kubernetes/
Right now I'm in shock with the repercussions of retooling our clusters to re-enable the PSP behavior which I completely assumed I'd never ever have to think about again...
As far as documentation goes, Linkerd deploys their PSPs in every installation and the logic is along the lines that it makes no difference if Kubernetes doesn't have it enabled. As I never used OpenShift, maybe an if
statement that checks for Kubernetes in add-targeted-namespace.sh, a couple of JSON files and those RoleBindings
would suffice?
To change the documentation: does this repo hosts the documentation also?
But again, if it's being removed...
I would keep an eye on how the PSP discussion evolves -- given how fast things move in Kubernetes in general and it still seems up in the air.
We do keep the doc source code in this repository in a folder conveniently called docs. I would not be opposed to keeping some information on using PSPs (or SCCs) in the documentation if it can be done in a way that's generally useful for these use cases. Knowing the security policies can vary very much between organizations, my goal would be for something that's generically helpful.
posting here incase anyone else runs into something similar:
I installed single node k3s on a RHEL8 VM using vagrant, then installed v4.5.1 of the operator.
I also saw this error message, when installing a new cluster
"stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.42.0.106' terminated unexpectedly [255]: ssh: connect to host 10.42.0.106 port 2022: No route to host\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
I tried your suggestions without joy:
But @jkatz one of your posts about works fine on GKE made me try a kind cluster and it immediately worked fine up there.
I then ran k3s check-config
and it said something super interesting:
System:
- /usr/sbin iptables v1.8.4 (nf_tables): should be older than v1.8.0 or in legacy mode (fail)
I then remembered that RHEL8 uses a newer version of iptables that is not compatible with kube-proxy, so I suspect that's where my network failure came from and that it'd probably work on a RHEL7 machine. Anyways I figured this would give you good info to share with any others who see the error/something to add to your documentation to check if your iptables is too new.
Having the same issue with pro version 4.7.2
pgo show user -n pgo gritview-postgres --show-system-accounts
CLUSTER USERNAME PASSWORD EXPIRES STATUS ERROR
------- -------- -------- ------- ------ -----------------------------------------------------------------------------
error primary pod not found for selector "pg-cluster=gritview-postgres,role=master"
kubectl -n pgo describe service gritview-postgres-backrest-shared-repo
Name: gritview-postgres-backrest-shared-repo
Namespace: pgo
Labels: name=gritview-postgres-backrest-shared-repo
pg-cluster=gritview-postgres
pgo-backrest-repo=true
vendor=crunchydata
Annotations: cloud.google.com/neg: {"ingress":true}
Selector: name=gritview-postgres-backrest-shared-repo
Type: ClusterIP
IP: 10.44.9.183
Port: <unset> 2022/TCP
TargetPort: 2022/TCP
Endpoints:
Session Affinity: None
Events: <none>
Seem like this image version is was not being found
centos8-13.3-4.7.2
Describe the bug It seems that the operator is not setting up roles and passwords for the pgbackrest and metrics containers.
To Reproduce
After waiting a bit, this is the list of pods:
These are the logs of the backup stanza,
At first, I was confused by the message
unable to find primary cluster
and I thought this may be down to some networking issue. However, on further inspection in the database container, I can see that the request to authenticate did reach the pod.These are the logs for
/pgdata/hippo/pg_log/postgresql-Tue.log
in the database container:And the authentication errors happen too in the metrics container:
Expected behavior Roles and passwords are set up correct for the pgbackrest and metrics users that the other containers will use
Please tell us about your environment: