Closed Richard87 closed 2 years ago
@Richard87 could you provide the specific steps, commands, etc. that you used to create the cluster in question? It appears you may have been attempting a clone across namespaces, and the errors may be related to an initial failure of the restore process.
Hi @tjmoore4 ! Yes,we are cloning accross namespaces, and that works great! (but maybe it shouldnt for security purposes?)
To create a cluster, we apply this yaml, often many times to reset the staging cluster as a clone of production:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: eportaldb
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.35-0
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
instances:
- dataVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
io.kompose.service: eportal
topologyKey: kubernetes.io/hostname
weight: 100
name: main
resources:
limits:
memory: 500Mi
cpu: 500m
requests:
memory: 500Mi
cpu: 500m
replicas: 1
postgresVersion: 14
dataSource:
postgresCluster:
repoName: repo1
clusterName: eportaldb
clusterNamespace: eportal
proxy:
pgBouncer:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:centos8-1.16-0
port: 5432
replicas: 1
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: eportaldb
postgres-operator.crunchydata.com/role: pgbouncer
topologyKey: kubernetes.io/hostname
weight: 1
Hey @Richard87, it sounds like you are leaving the DataSource
in your spec after you complete the clone. Is this correct?
We recommend removing that section from your spec after the clone has completed. If you are leaving it in, try to perform a clone, remove or comment out the DataSource
section, then try to replicate this issue. Make sure to check that the cluster is in a healthy state between steps. Please try this and let us know if you continue to run into this issue.
Thanks, something seemed to have changed, and the operator does trigger requesten changes now(even with the data source in).
I managed to upgrade the staging cluster to 14.1, and it seems scheduled backups works as expected as well!
I will run som tests on the product ion cluster next weekend, and see if it works just as good!
Sounds good! I'll go ahead and close this issue but feel free to re-open if you run into any issues.
Hi @jmckulk I had the same error again today and wonder if this issue should be reopened.
When updating starting to update the cluster to v5.1 ( Changing the image versions in use) the reconciling failed with the same error as before, I also had the same dataSource active in the yaml, but I dont think that was the reason for the failure.
I think it was because the clusted didn't have any backups (no schedule setup for repo1), and therefor failed to reconcile the update image with the error unable to find instance name for pgBackRest restore Job
.
My solution was to delete the cluster, and re-create a new one with the same name (this time with a backup schedule configured!).
So, I think the operator should create a one-of backup if the error occurs, and use that newly created backup to continue setting up a new cluster.
Also if possible, not allow that failure to stop other changes from beeing completed, but it might be incredibly complex to cover all cases!
Overview
I changed the image from
registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0
toregistry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.1-0
in a test cluster, but nothing changed.Since it was a single instance db (no replicas), i changed replicas to 2, hoping it would create a rolling update, but no change here either, and no secondary instance was brought up.
EDIT I think this line from the operator log, is the most telling, but I dont know what to do about it:
/EDIT
Environment
GKE
EXPECTED
ACTUAL
Logs
Postgres:
Postgres 2:
PGO:
Additional Information
The cluster definition: