CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.96k stars 594 forks source link

Authentication issues on new cluster #1362

Closed cortopy closed 4 years ago

cortopy commented 4 years ago

Describe the bug It seems that the operator is not setting up roles and passwords for the pgbackrest and metrics containers.

To Reproduce

  1. Create the cluster:
    pgo create cluster -n pgouser1 hippo

    After waiting a bit, this is the list of pods:

    + kubectl get pods --namespace pgouser1
    NAME                                          READY   STATUS    RESTARTS   AGE
    hippo-64c9968495-fpmp5                        2/2     Running   0          2m4s
    hippo-backrest-shared-repo-7c45c5d5f8-pzxcf   1/1     Running   0          2m4s
    hippo-stanza-create-d6rtz                     0/1     Error     0          87s

    These are the logs of the backup stanza,

    + kubectl logs -f --namespace pgouser1 hippo-stanza-create-d6rtz
    time="2020-03-24T18:20:11Z" level=info msg="pgo-backrest starts"
    time="2020-03-24T18:20:11Z" level=info msg="debug flag set to false"
    time="2020-03-24T18:20:11Z" level=info msg="backrest stanza-create command requested"
    time="2020-03-24T18:20:11Z" level=info msg="command to execute is [pgbackrest stanza-create  --db-host=192.168.1.119 --db-path=/pgdata/hippo]"
    time="2020-03-24T18:20:11Z" level=info msg="command is pgbackrest stanza-create  --db-host=192.168.1.119 --db-path=/pgdata/hippo "
    time="2020-03-24T18:20:12Z" level=error msg="command terminated with exit code 56"
    time="2020-03-24T18:20:12Z" level=info msg="output=[]"
    time="2020-03-24T18:20:12Z" level=info msg="stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '192.168.1.119' terminated unexpectedly [255]: Authentication failed.\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
    time="2020-03-24T18:20:12Z" level=error msg="command terminated with exit code 56"

At first, I was confused by the message unable to find primary cluster and I thought this may be down to some networking issue. However, on further inspection in the database container, I can see that the request to authenticate did reach the pod.

These are the logs for /pgdata/hippo/pg_log/postgresql-Tue.log in the database container:

2020-03-24 18:19:48.594 UTC [186] LOG:  database system was shut down at 2020-03-24 18:19:46 UTC
2020-03-24 18:19:48.621 UTC [183] LOG:  database system is ready to accept connections
2020-03-24 18:19:49.156 UTC [194] FATAL:  password authentication failed for user "ccp_monitoring"
2020-03-24 18:19:49.156 UTC [194] DETAIL:  Role "ccp_monitoring" does not exist.
    Connection matched pg_hba.conf line 7: "host all all 0.0.0.0/0 md5"
2020-03-24 18:19:49.176 UTC [195] FATAL:  password authentication failed for user "ccp_monitoring"
2020-03-24 18:19:49.176 UTC [195] DETAIL:  Role "ccp_monitoring" does not exist.
    Connection matched pg_hba.conf line 7: "host all all 0.0.0.0/0 md5"
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:19:59.352 UTC [262] ERROR:  invalid input syntax for type json
2020-03-24 18:19:59.352 UTC [262] DETAIL:  The input string ended unexpectedly.
2020-03-24 18:19:59.352 UTC [262] CONTEXT:  JSON data, line 1: 
    COPY pgbackrest_info, line 1, column data: ""
    SQL statement "COPY monitor.pgbackrest_info (config_file, data) FROM program '/opt/cpm/bin/pgbackrest_info.sh' WITH (format text,DELIMITER '|')"
    PL/pgSQL function monitor.pgbackrest_info(integer) line 21 at SQL statement
2020-03-24 18:19:59.352 UTC [262] STATEMENT:  WITH all_backups AS ( SELECT config_file , jsonb_array_elements(data) AS stanza_data FROM monitor.pgbackrest_info(10) ) , per_stanza AS ( SELECT config_file , stanza_data->>'name' AS stanza , jsonb_array_elements(stanza_data->'backup') AS backup_data FROM all_backups ) SELECT config_file , stanza , backup_data->>'type' AS backup_type , EXTRACT( epoch FROM (max(to_timestamp((backup_data->'timestamp'->>'stop')::bigint))) - max(to_timestamp((backup_data->'timestamp'->>'start')::bigint)) ) AS backup_runtime_seconds FROM per_stanza GROUP BY config_file, stanza, backup_data->>'type'
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:20:00.635 UTC [262] ERROR:  invalid input syntax for type json
2020-03-24 18:20:00.635 UTC [262] DETAIL:  The input string ended unexpectedly.
2020-03-24 18:20:00.635 UTC [262] CONTEXT:  JSON data, line 1: 
    COPY pgbackrest_info, line 1, column data: ""
    SQL statement "COPY monitor.pgbackrest_info (config_file, data) FROM program '/opt/cpm/bin/pgbackrest_info.sh' WITH (format text,DELIMITER '|')"
    PL/pgSQL function monitor.pgbackrest_info(integer) line 21 at SQL statement
2020-03-24 18:20:00.635 UTC [262] STATEMENT:  WITH all_backups AS ( SELECT config_file , jsonb_array_elements(data) AS stanza_data FROM monitor.pgbackrest_info(10) ) , per_stanza AS ( SELECT config_file , stanza_data->>'name' AS stanza , jsonb_array_elements(stanza_data->'backup') AS backup_data FROM all_backups ) SELECT config_file , stanza , extract(epoch from (CURRENT_TIMESTAMP - max(to_timestamp((backup_data->'timestamp'->>'stop')::bigint)))) AS time_since_completion_seconds FROM per_stanza WHERE backup_data->>'type' IN ('full') GROUP BY config_file, stanza
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:20:02.832 UTC [262] ERROR:  invalid input syntax for type json
2020-03-24 18:20:02.832 UTC [262] DETAIL:  The input string ended unexpectedly.
2020-03-24 18:20:02.832 UTC [262] CONTEXT:  JSON data, line 1: 
    COPY pgbackrest_info, line 1, column data: ""
    SQL statement "COPY monitor.pgbackrest_info (config_file, data) FROM program '/opt/cpm/bin/pgbackrest_info.sh' WITH (format text,DELIMITER '|')"
    PL/pgSQL function monitor.pgbackrest_info(integer) line 21 at SQL statement
2020-03-24 18:20:02.832 UTC [262] STATEMENT:  WITH all_backups AS ( SELECT config_file , jsonb_array_elements(data) AS stanza_data FROM monitor.pgbackrest_info(10) ) , per_stanza AS ( SELECT config_file , stanza_data->>'name' AS stanza , jsonb_array_elements(stanza_data->'backup') AS backup_data FROM all_backups ) SELECT config_file , stanza , extract(epoch from (CURRENT_TIMESTAMP - max(to_timestamp((backup_data->'timestamp'->>'stop')::bigint)))) AS time_since_completion_seconds FROM per_stanza WHERE backup_data->>'type' IN ('full', 'diff', 'incr') GROUP BY config_file, stanza
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:20:04.227 UTC [262] ERROR:  invalid input syntax for type json
2020-03-24 18:20:04.227 UTC [262] DETAIL:  The input string ended unexpectedly.
2020-03-24 18:20:04.227 UTC [262] CONTEXT:  JSON data, line 1: 
    COPY pgbackrest_info, line 1, column data: ""
    SQL statement "COPY monitor.pgbackrest_info (config_file, data) FROM program '/opt/cpm/bin/pgbackrest_info.sh' WITH (format text,DELIMITER '|')"
    PL/pgSQL function monitor.pgbackrest_info(integer) line 21 at SQL statement
2020-03-24 18:20:04.227 UTC [262] STATEMENT:  WITH all_backups AS ( SELECT config_file , jsonb_array_elements(data) AS stanza_data FROM monitor.pgbackrest_info(10) ) , per_stanza AS ( SELECT config_file , stanza_data->>'name' AS stanza , jsonb_array_elements(stanza_data->'backup') AS backup_data FROM all_backups ) SELECT config_file , stanza , extract(epoch from (CURRENT_TIMESTAMP - max(to_timestamp((backup_data->'timestamp'->>'stop')::bigint)))) AS time_since_completion_seconds FROM per_stanza WHERE backup_data->>'type' IN ('full', 'diff') GROUP BY config_file, stanza
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:20:49.168 UTC [191] LOG:  archive command failed with exit code 125
2020-03-24 18:20:49.168 UTC [191] DETAIL:  The failed archive command was: source /tmp/pgbackrest_env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001"
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:20:50.845 UTC [191] LOG:  archive command failed with exit code 125
2020-03-24 18:20:50.845 UTC [191] DETAIL:  The failed archive command was: source /tmp/pgbackrest_env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001"
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:20:52.572 UTC [191] LOG:  archive command failed with exit code 125
2020-03-24 18:20:52.572 UTC [191] DETAIL:  The failed archive command was: source /tmp/pgbackrest_env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001"
2020-03-24 18:20:52.573 UTC [191] WARNING:  archiving write-ahead log file "000000010000000000000001" failed too many times, will try again later
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.
2020-03-24 18:21:53.280 UTC [191] LOG:  archive command failed with exit code 125
2020-03-24 18:21:53.280 UTC [191] DETAIL:  The failed archive command was: source /tmp/pgbackrest_env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001"
ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.

And the authentication errors happen too in the metrics container:

+ kubectl logs -f --namespace pgouser1 hippo-64c9968495-fpmp5 -c collect
Tue Mar 24 18:19:45 UTC 2020 INFO: Setting credentials for collect PG user using file system
Tue Mar 24 18:19:45 UTC 2020 INFO: Waiting for PostgreSQL to be ready..
127.0.0.1:5432 - no response
127.0.0.1:5432 - no response
127.0.0.1:5432 - accepting connections
Tue Mar 24 18:19:49 UTC 2020 INFO: Checking if PostgreSQL is accepting queries..
psql: error: could not connect to server: FATAL:  password authentication failed for user "ccp_monitoring"
              now              
-------------------------------
 2020-03-24 18:19:51.193887+00
(1 row)

Tue Mar 24 18:19:51 UTC 2020 INFO: No custom queries detected. Applying default configuration..
Tue Mar 24 18:19:51 UTC 2020 INFO: Starting postgres-exporter..
time="2020-03-24T18:19:51Z" level=info msg="Established new database connection." source="postgres_exporter.go:1035"
time="2020-03-24T18:19:51Z" level=info msg="Semantic Version Changed: 0.0.0 -> 12.2.0" source="postgres_exporter.go:965"
time="2020-03-24T18:19:59Z" level=info msg="Error running query on database:  ccp_backrest_last_runtime pq: invalid input syntax for type json\n" source="postgres_exporter.go:933"
time="2020-03-24T18:20:00Z" level=info msg="Error running query on database:  ccp_backrest_last_full_backup pq: invalid input syntax for type json\n" source="postgres_exporter.go:933"
time="2020-03-24T18:20:02Z" level=info msg="Error running query on database:  ccp_backrest_last_incr_backup pq: invalid input syntax for type json\n" source="postgres_exporter.go:933"
time="2020-03-24T18:20:04Z" level=info msg="Error running query on database:  ccp_backrest_last_diff_backup pq: invalid input syntax for type json\n" source="postgres_exporter.go:933"
time="2020-03-24T18:20:04Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1178"

Expected behavior Roles and passwords are set up correct for the pgbackrest and metrics users that the other containers will use

Please tell us about your environment:

jkatz commented 4 years ago

This looks like it's a networking issue.

2020-03-24 18:19:49.156 UTC [194] DETAIL: Role "ccp_monitoring" does not exist. Connection matched pg_hba.conf line 7: "host all all 0.0.0.0/0 md5"

Authentication is failing for the ccp_monitoring user because it did not initially exist. When it's created, subsequently there appears to be a different error with loading some of its JSON. I had recently tested it off of master and things appear to load correctly. I'll check on REL_4_2 at some point.

For the other error:

[255]: Authentication failed.\nERROR: [056]: unable to find primary cluster - cannot proceed\n]

Authentication failed because it could not find the primary cluster.

Please try creating the cluster again.

cortopy commented 4 years ago

@jkatz thanks for the speedy response! I have tried a few times already, but no success. Have you seen the logs for the database container? The request from pgbackrest does reach the pod, and so that's why I discarded the networking possibility

jkatz commented 4 years ago

I see this:

2020-03-24 18:21:53.280 UTC [191] DETAIL: The failed archive command was: source /tmp/pgbackrest_env.sh && pgbackrest archive-push "pg_wal/000000010000000000000001" ERROR: [125]: remote-0 process on 'hippo-backrest-shared-repo' terminated unexpectedly [255]: Authentication failed.

which is not pgBackRest reaching the PostgreSQL Pod, but rather PostgreSQL trying to execute the WAL archive command and not being able to reach the pgBackRest pod.

Does your environment block SSH by chance?

cortopy commented 4 years ago

I don't think so. This is a staging self-managed cluster with Cilium as CNI but I haven't configured any network policies for this

jkatz commented 4 years ago

If possible, I would recommend disabling Cilium and then trying to deploy a cluster to see if it works.

jkatz commented 4 years ago

Having not heard back, I believe this is an issue with the Cilium configuration. If you are able to investigate more and report back, please let us know. In the interim I am closing this.

davi5e commented 4 years ago

Same issue, same error. Also 4.2.2.

I use Network Policies managed by Calico, but I fully enabled ingress and egress traffic inside the cluster namespace. I also don't think this is a network problem, seem to be something with SSH.

After reading the issue https://github.com/pgbackrest/pgbackrest/issues/895 I'm under the impression either container expects a password to SSH and the lack of it is causing the problem.

davi5e commented 4 years ago

Bellow there is what I'm doing. Stanza crashes every time...

Logs:

time="2020-05-07T00:50:32Z" level=info msg="pgo-backrest starts"
time="2020-05-07T00:50:32Z" level=info msg="debug flag set to false"
time="2020-05-07T00:50:32Z" level=info msg="backrest stanza-create command requested"
time="2020-05-07T00:50:32Z" level=info msg="command to execute is [pgbackrest stanza-create  --db-host=10.3.0.59 --db-path=/pgdata/website-cluster]"
time="2020-05-07T00:50:32Z" level=info msg="command is pgbackrest stanza-create  --db-host=10.3.0.59 --db-path=/pgdata/website-cluster "
time="2020-05-07T00:50:34Z" level=error msg="command terminated with exit code 56"
time="2020-05-07T00:50:34Z" level=info msg="output=[]"
time="2020-05-07T00:50:34Z" level=info msg="stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.3.0.59' terminated unexpectedly [255]: Authentication failed.\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
time="2020-05-07T00:50:34Z" level=error msg="command terminated with exit code 56"

If it helps, here is my pgo.yaml

Cluster:
  PrimaryNodeLabel: part-of=default
  ReplicaNodeLabel: part-of=default
  CCPImagePrefix: crunchydata
  Metrics: true
  Badger: false
  CCPImageTag: centos7-12.2-4.2.2
  Port: 5432
  PGBadgerPort: 10000
  ExporterPort: 9187
  User: postgres
  Database: website
  PasswordAgeDays: 
  PasswordLength: 24
  Policies: 
  Strategy: 1
  Replicas: 1
  ArchiveMode: false
  ArchiveTimeout: 60
  ServiceType: ClusterIP
  Backrest: true
  BackrestPort: 2022
  BackrestS3Bucket: 
  BackrestS3Endpoint: 
  BackrestS3Region: 
  DisableAutofail: false
  LogStatement: none
  LogMinDurationStatement: 60000
  PodAntiAffinity: preferred
  SyncReplication: true
PrimaryStorage: custom
BackupStorage: gce
ReplicaStorage: gce
BackrestStorage: gce
Storage:
  custom:
    AccessMode: ReadWriteOnce
    Fsgroup: 26
    MatchLabels: 
    Size: 25G
    StorageClass: persistent-pd
    StorageType: dynamic
    SupplementalGroups: 
  gce:
    AccessMode: ReadWriteOnce
    Size: 25G
    StorageType: dynamic
    StorageClass: standard
    Fsgroup: 26
DefaultContainerResources: custom
DefaultLoadResources: 
DefaultLspvcResource: 
DefaultRmdataResources: 
DefaultBackupResources: 
DefaultPgbouncerResources: 
ContainerResources:
  custom:
    RequestsMemory: 256Mi
    RequestsCPU: 0.1
    LimitsMemory: 1Gi
    LimitsCPU: 0.5
Pgo:
  PreferredFailoverNode: 
  Audit: false
  PGOImagePrefix: crunchydata
  PGOImageTag: centos7-4.2.2

I then issue these commands via a Bash script, edited for brevity:

# Create
pgo create cluster my-cluster --namespace=cluster-ns

# Wait for completion
kubectl -n cluster-ns rollout status -w my-cluster

# Enable autofail
echo yes | pgo update cluster my-cluster --enable-autofail --namespace=cluster-ns

# Enable backups
pgo create schedule my-cluster --schedule="0 4 * * *" --schedule-type=pgbackrest --pgbackrest-backup-type=full --schedule-opts="--repo1-retention-full=7" --namespace=cluster-ns
jkatz commented 4 years ago

The containers are using passwordless SSH keys. This sounds like something with your Calico configuration.

The telltale error is this:

[056]: unable to find primary cluster - cannot proceed

The reason for the "authentication failed" error is because the stanza create job cannot connect to the pgBackRest repository.

davi5e commented 4 years ago

Hello @jkatz !

OK, I'll debug the Network Policies.

In the meantime, the stanza Pod is trying to connect via port 2022 to the IP of the PostgreSQL master Pod, which have some containers inside.

As far as I can see, there is no container declaring this port though.

When I describe the pod, I get ports 5432 and 8009 (both TCP) for the database container, port 9187 (TCP) for the collect container.

Maybe my configuration is missing something? Where would the SSH connection go?

jkatz commented 4 years ago

2022 exists both for the pgBackRest repository to connect to a PostgreSQL pod, and for the PostgreSQL pod to push archives to the pgBackRest repository. For example, in my cluster called hippo:

PostgreSQL

kubectl -n pgo describe service hippo

yields

Name:              hippo
Namespace:         pgo
Labels:            name=hippo
                   pg-cluster=hippo
                   vendor=crunchydata
Annotations:       <none>
Selector:          pg-cluster=hippo,role=master
Type:              ClusterIP
IP:                10.96.145.62
Port:              pgbadger  10000/TCP
TargetPort:        10000/TCP
Endpoints:         10.44.0.3:10000
Port:              postgres-exporter  9187/TCP
TargetPort:        9187/TCP
Endpoints:         10.44.0.3:9187
Port:              sshd  2022/TCP
TargetPort:        2022/TCP
Endpoints:         10.44.0.3:2022
Port:              patroni  8009/TCP
TargetPort:        8009/TCP
Endpoints:         10.44.0.3:8009
Port:              postgres  5432/TCP
TargetPort:        5432/TCP
Endpoints:         10.44.0.3:5432
Session Affinity:  None
Events:            <none>

pgBackRest Repository

kubectl -n pgo describe service hippo-backrest-shared-repo

yields

Name:              hippo-backrest-shared-repo
Namespace:         pgo
Labels:            name=hippo-backrest-shared-repo
                   pg-cluster=hippo
                   pgo-backrest-repo=true
                   vendor=crunchydata
Annotations:       <none>
Selector:          name=hippo-backrest-shared-repo
Type:              ClusterIP
IP:                10.96.168.67
Port:              <unset>  2022/TCP
TargetPort:        2022/TCP
Endpoints:         10.44.0.2:2022
Session Affinity:  None
Events:            <none>
davi5e commented 4 years ago

I just updated pgo to 4.3.0 but to no avail.

I can confirm the backrest pod declares port 2022 but it's still absent from the postgresql pod...

Going over cluster-deployment.json I see there is a volume for sshd, but no port.

@jkatz could you send all hippo pods descriptions? The services I create are like the ones you posted.

davi5e commented 4 years ago

After thinking about why the CNI causes the issue, maybe it just enforces that the Pods need to declare each port they use?

If this assumption is correct, it would indeed be a network problem, since Calico blocks traffic to a port it does not recognize... Since the stanza pod knows which IP to use, sending traffic directly would be a problem.

That's the only explanation I can think of. Other consequences would be those other ports in the service wouldn't receive traffic either.

davi5e commented 4 years ago

In GKE, I can confirm that without Network Policies or Pod Security Policies everything works...

I'll try to dig deeper into what's going on with Calico or if it's some PSP misconfiguration and post it here.

davi5e commented 4 years ago

@jkatz I found the source of the error, it's not related to Calico (although you are right that a correct network policy must be in place), it's related to the Pod Security Policy and the lack of some permissions to the backrest service account.

In my cluster, this is the PSP that gets attributed to the job. Can you help me with information regarding what option is missing? Should be related to the network, I guess...

apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    apparmor.security.beta.kubernetes.io/allowedProfileNames: runtime/default
    apparmor.security.beta.kubernetes.io/defaultProfileName: runtime/default
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: docker/default,runtime/default
    seccomp.security.alpha.kubernetes.io/defaultProfileName: runtime/default
  name: zz-default
spec:
  allowPrivilegeEscalation: false
  fsGroup:
    ranges:
    - max: 65535
      min: 1
    rule: MustRunAs
  requiredDropCapabilities:
  - ALL
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    ranges:
    - max: 65535
      min: 1
    rule: MustRunAs
  volumes:
  - configMap
  - emptyDir
  - projected
  - secret
  - downwardAPI
  - persistentVolumeClaim
jkatz commented 4 years ago

@davi5e Try setting your fsGroup to 26:

  fsGroup:
    ranges:
    - max: 26
      min: 26

I'm not sure if you're using any supplemental groups, you may not need that section, but the "postgres" process expects the files to be owned by UID 26, hence the fsGroup setting.

jkatz commented 4 years ago

This may also help: https://github.com/CrunchyData/postgres-operator/blob/master/deploy/pgo-scc.yaml

davi5e commented 4 years ago

Thank you @jkatz !

This file, after converting from Openshift to a PSP, really worked.

If anyone ever reads this, I changed the user to RunAsAny and created a RoleBinging to every service account PGO creates (one in the operator namespace, four in each observed namespace), all pointing to the modified PSP YAML above (spec here for v1.15). For me, this is secure enough.

Maybe this should go into the documentation somewhere? I read this link about security but the Pod Security Policy is absent...

jkatz commented 4 years ago

@davi5e Glad to hear it's working. We probably should augment the security documentation. Patches and contributions are certainly welcome :wink:

internetuser2008 commented 4 years ago

I am facing similar issue with pgo 4.4.0, I can't seem to figure out what's being blocked port 2022 seems to be open but how do I confirm?

pgo create cluster bpg2  -n pgo
kubectl get pods -npgo
NAME                                        READY   STATUS      RESTARTS   AGE
bpg2-backrest-shared-repo-6947dd5f4-xzj9m   1/1     Running     0          65m
bpg2-c7d459fc7-pmklb                        1/1     Running     1          64m
bpg2-stanza-create-2glzl                    0/1     Error       0          64m
bpg2-stanza-create-9zfzr                    0/1     Error       0          63m
bpg2-stanza-create-gbbrm                    0/1     Error       0          63m
bpg2-stanza-create-mk5ww                    0/1     Error       0          64m
bpg2-stanza-create-s4cbh                    0/1     Error       0          62m
pgo-deploy-p29sc                            0/1     Completed   0          5h37m
postgres-operator-5d5ff486c7-6pcjm          4/4     Running     0          5h36m

Job Log:

time="2020-07-24T23:21:19Z" level=info msg="pgo-backrest starts"
time="2020-07-24T23:21:19Z" level=info msg="debug flag set to false"
time="2020-07-24T23:21:19Z" level=info msg="backrest stanza-create command requested"
time="2020-07-24T23:21:19Z" level=info msg="command to execute is [pgbackrest stanza-create  --db-host=10.42.5.228 --db-path=/pgdata/bpg2]"
time="2020-07-24T23:21:19Z" level=info msg="command is pgbackrest stanza-create  --db-host=10.42.5.228 --db-path=/pgdata/bpg2 "
time="2020-07-24T23:21:19Z" level=error msg="command terminated with exit code 56"
time="2020-07-24T23:21:19Z" level=info msg="output=[]"
time="2020-07-24T23:21:19Z" level=info msg="stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.42.5.228' terminated unexpectedly [255]: Authentication failed.\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"
time="2020-07-24T23:21:19Z" level=error msg="command terminated with exit code 56"

Command from pgbackrest pod:

[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ pgbackrest stanza-create  --db-host=10.42.5.228 --db-path=/pgdata/bpg2
WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.42.5.228' terminated unexpectedly [255]: Authentication failed.
ERROR: [056]: unable to find primary cluster - cannot proceed
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh 10.42.5.228 2022
Password: 
Password: 

[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh 10.42.5.228 222 
Password: 

[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh 10.42.5.228 22 
Password: 
^C
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh postgres@10.42.5.228 22
Authentication failed.
[pgbackrest@bpg2-backrest-shared-repo-6947dd5f4-xzj9m /]$ ssh root@10.42.5.228 22
Password: 
^C

 kubectl -npgo describe service bpg2
Name:              bpg2
Namespace:         pgo
Labels:            name=bpg2
                   pg-cluster=bpg2
                   vendor=crunchydata
Annotations:       <none>
Selector:          pg-cluster=bpg2,role=master
Type:              ClusterIP
IP:                10.43.69.159
Port:              sshd  2022/TCP
TargetPort:        2022/TCP
Endpoints:         10.42.5.228:2022
Port:              postgres  5432/TCP
TargetPort:        5432/TCP
Endpoints:         10.42.5.228:5432
Session Affinity:  None
Events:            <none>
MarekBiolik commented 4 years ago

@jkatz this link expired (you posted it on 12 May): https://github.com/CrunchyData/postgres-operator/blob/master/deploy/pgo-scc.yaml Could you provide the new one please?

davi5e commented 4 years ago

https://github.com/CrunchyData/postgres-operator/blob/master/examples/pgo-scc.yaml

jkatz commented 4 years ago

As of Operator v4.3.1, the pgo-scc is no longer required if you're running in restricted mode:

https://github.com/CrunchyData/postgres-operator/releases/tag/v4.3.1

  • Introduce DISABLE_FSGROUP option as part of the installation. When set to true, this does not add a FSGroup to the Pod Security Context when deploying PostgreSQL related containers or pgAdmin 4. This is helpful when deploying the PostgreSQL Operator in certain environments, such as OpenShift with a restricted Security Context Constraint. Defaults to false.
  • Remove the custom Security Context Constraint (SCC) that would be deployed with the PostgreSQL Operator, so now the PostgreSQL Operator can be deployed using default OpenShift SCCs (e.g. "restricted", though note that DISABLE_FSGROUP will need to be set to true for that). The example PostgreSQL Operator SCC is left in the examples directory for reference.
ppodevlabs commented 3 years ago

https://github.com/CrunchyData/postgres-operator/blob/master/examples/pgo-scc.yaml

@davi5e Could you please share the PSP you created? i tried to apply it to my operator but i can not fix the issue. I'm also running calico but there are no networkpolicies in place.

davi5e commented 3 years ago
---
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  annotations:
    kubernetes.io/description: Policy to allow PGO to function properly.
    seccomp.security.alpha.kubernetes.io/allowedProfileNames: '*'
  name: pgo-operator
spec:
  allowPrivilegeEscalation: true
  defaultAllowPrivilegeEscalation: true
  fsGroup:
    ranges:
    - max: 2
      min: 2
    - max: 26
      min: 26
    rule: MustRunAs
  requiredDropCapabilities:
  - KILL
  - MKNOD
  - SETGID
  - SETUID
  runAsUser:
    rule: MustRunAsNonRoot
  seLinux:
    rule: RunAsAny
  supplementalGroups:
    rule: RunAsAny
  volumes:
  - configMap
  - downwardAPI
  - emptyDir
  - persistentVolumeClaim
  - projected
  - secret

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: psp-pgo
rules:
- apiGroups:
  - extensions
  resourceNames:
  - pgo-operator
  resources:
  - podsecuritypolicies
  verbs:
  - use

Then you create a RoleBinding for serviceaccount/postgres-operator over PGO's namespace and 4 on each target namespace to account for serviceaccount/pgo-backrest, serviceaccount/pgo-default, serviceaccount/pgo-pg, and serviceaccount/pgo-target.

Hope it helps.

OBS: The code above is subject to changes if pgo-scc.yaml ever changes... It's mainly a simple one-to-one conversion.

jkatz commented 3 years ago

@davi5e If this is something that can be more generally helpful, we can find a way to include it in the documentation, my guess being somewhere in the installation section in a part about considerations for OpenShift?

...though I may throw some caution to the wind here, as there is presently discussion Pod Security Policies may be going away: https://www.antitree.com/2020/11/pod-security-policies-are-being-deprecated-in-kubernetes/

davi5e commented 3 years ago

Right now I'm in shock with the repercussions of retooling our clusters to re-enable the PSP behavior which I completely assumed I'd never ever have to think about again...

As far as documentation goes, Linkerd deploys their PSPs in every installation and the logic is along the lines that it makes no difference if Kubernetes doesn't have it enabled. As I never used OpenShift, maybe an if statement that checks for Kubernetes in add-targeted-namespace.sh, a couple of JSON files and those RoleBindings would suffice?

To change the documentation: does this repo hosts the documentation also?

But again, if it's being removed...

jkatz commented 3 years ago

I would keep an eye on how the PSP discussion evolves -- given how fast things move in Kubernetes in general and it still seems up in the air.

We do keep the doc source code in this repository in a folder conveniently called docs. I would not be opposed to keeping some information on using PSPs (or SCCs) in the documentation if it can be done in a way that's generally useful for these use cases. Knowing the security policies can vary very much between organizations, my goal would be for something that's generically helpful.

neoakris commented 3 years ago

posting here incase anyone else runs into something similar:

I installed single node k3s on a RHEL8 VM using vagrant, then installed v4.5.1 of the operator.
I also saw this error message, when installing a new cluster

"stderr=[WARN: unable to check pg-1: [UnknownError] remote-0 process on '10.42.0.106' terminated unexpectedly [255]: ssh: connect to host 10.42.0.106 port 2022: No route to host\nERROR: [056]: unable to find primary cluster - cannot proceed\n]"

I tried your suggestions without joy:

But @jkatz one of your posts about works fine on GKE made me try a kind cluster and it immediately worked fine up there.

I then ran k3s check-configand it said something super interesting:

System:

  • /usr/sbin iptables v1.8.4 (nf_tables): should be older than v1.8.0 or in legacy mode (fail)

I then remembered that RHEL8 uses a newer version of iptables that is not compatible with kube-proxy, so I suspect that's where my network failure came from and that it'd probably work on a RHEL7 machine. Anyways I figured this would give you good info to share with any others who see the error/something to add to your documentation to check if your iptables is too new.

Dnathan33 commented 3 years ago

Having the same issue with pro version 4.7.2

pgo show user -n pgo gritview-postgres --show-system-accounts     

CLUSTER USERNAME PASSWORD EXPIRES STATUS ERROR
------- -------- -------- ------- ------ -----------------------------------------------------------------------------
                                  error  primary pod not found for selector "pg-cluster=gritview-postgres,role=master"
Screen Shot 2021-08-27 at 7 09 34 PM
kubectl -n pgo describe service gritview-postgres-backrest-shared-repo

Name:              gritview-postgres-backrest-shared-repo
Namespace:         pgo
Labels:            name=gritview-postgres-backrest-shared-repo
                   pg-cluster=gritview-postgres
                   pgo-backrest-repo=true
                   vendor=crunchydata
Annotations:       cloud.google.com/neg: {"ingress":true}
Selector:          name=gritview-postgres-backrest-shared-repo
Type:              ClusterIP
IP:                10.44.9.183
Port:              <unset>  2022/TCP
TargetPort:        2022/TCP
Endpoints:
Session Affinity:  None
Events:            <none>

Seem like this image version is was not being found

centos8-13.3-4.7.2