[Bug]: PGData already exists, can't overwrite for scale down and scale up .

Hashdhi commented 8 months ago

Is there an existing issue already for this bug?

[X] I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

[X] I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

[X] I have read the troubleshooting guide and I think this is a new bug.

Contact Details

selvarajchennappan@gmail.com

Version

1.22.0

What version of Kubernetes are you using?

1.28

What is your Kubernetes environment?

Self-managed: kind (evaluation)

How did you install the operator?

YAML manifest

What happened?

How to launch a cluster on existing pvc . We have scenario to scale down and scale up . initially had 3 replicas and scale down to 1 then scaled up to i.e instances : 3 and applied yaml . initdb failed . at the time of scale up it says that {"level":"info","ts":"2024-02-03T06:08:14Z","msg":"PGData already exists, can't overwrite","logging_pod":"srims-prod-1-initdb"} Error: PGData directories already exist

Cluster resource

apiVersion: v1
kind: Namespace
metadata:
  name: postgres
---
apiVersion: v1
data:
  username: cG9zdGdyZXM=
  password: cGFzc3dvcmQ=
kind: Secret
metadata:
  name: cluster-example-superuser
  namespace: postgres
type: kubernetes.io/basic-auth
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: abc-prod
  namespace: postgres
spec:
  description: "Cluster for srims"
  # Choose your PostGres Database Version
  imageName: ghcr.io/cloudnative-pg/postgresql:16.1
  # Number of Replicas
  instances: 3

  minSyncReplicas: 1
  maxSyncReplicas: 1
  startDelay: 100
  stopDelay: 100
  replicationSlots:
    highAvailability:
      enabled: true
    updateInterval: 300
  primaryUpdateStrategy: unsupervised

  postgresql:
    parameters:
      shared_buffers: 256MB
      pg_stat_statements.max: '10000'
      pg_stat_statements.track: all
      auto_explain.log_min_duration: '10s'
      wal_keep_size: '512MB'
      pgaudit.log: "all, -misc"
      logging_collector: "on"
      log_destination: csvlog
      log_directory: /controller/log
      pgaudit.log_catalog: "off"
      pgaudit.log_parameter: "on"
      pgaudit.log_relation: "on"
    pg_hba:
      - host app app all password
    enableAlterSystem: true

  enableSuperuserAccess: true
  superuserSecret:
    name: cluster-example-superuser
  logLevel: debug

  storage:
          #size: 100Gi
     pvcTemplate:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 75Gi
      storageClassName: local-storage
    #  volumeName: postgresnopg-local-storage-pv
      volumeMode: Filesystem

  resources: # m5large: m5xlarge 2vCPU, 8GI RAM
    requests:
      memory: "512Mi"
      cpu: "3"
    limits:
      memory: "1Gi"
      cpu: "4"

  affinity:
      enablePodAntiAffinity: true
      nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: tools
                operator: In
                values:
                - common

  nodeMaintenanceWindow:
    inProgress: false
    reusePVC: true

  monitoring:
    enablePodMonitor: false

Relevant log output

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

jlimai commented 7 months ago

We have the same issue. This bug is very critical not just for scale up or down. When we need to redeploy the postgres cluster in case of any failure due to node issue, we can't reschedule the pod to run in another node with existing PVC because it keeps complaining PGDATA already exists error.

dperetti commented 7 months ago

I'm exactly facing this as I'm evaluating Cloudnative. Looks like a show stopper for now 😕. It can be easily replicated:

Create a cluster, set up the db, etc.
Prevent the volume deletion (I did this directly in Hetzner web interface).
Delete the cluster. Thankfully, the volume is still there. Now I want to resurrect my cluster with the data of the volume.
I removed the claim reference from the PV.
I created a new PV to bind my pv.
I used the pvcTemplate in the cluster definition to point to my pvc. Like @Hashdhi, the initialization fails with "PGData directories already exist"...

PrivatePuffin commented 6 months ago

We have the same issue. This bug is very critical not just for scale up or down. When we need to redeploy the postgres cluster in case of any failure due to node issue, we can't reschedule the pod to run in another node with existing PVC because it keeps complaining PGDATA already exists error.

@gbartolini This really is an issue, why is the same cluster object, when restored, not able to consume its PVC's anymore? It looks like CNPG sets some sort of flag on a cluster object to indicate weither it needs to init or not.

This is not wanted behavior, at least not when it's not overridable. As it makes infrastructure as code a mess.

Another example: We need to reinstall/move some PVCs and with that specific platform, it's easier to just reinstall the helm chart and move the old PVC data to the new PVCs.

That works fine with literally every piece of software, except CNPG.

What would easily solve all of these issues is: initdb.useExisting: true

That would skip the initdb steps if existing pgdata folder is found and instead tries to use the database in said folder. This should work 100% without any negative consequences.

You can even set it to false by default, to ensure it doesn't cause any issues for existing users.

josephlim75 commented 2 months ago

Is there any progress on this issue ?

josephlim75 commented 2 months ago

@gbartolini or @Hashdhi , is this feature started development or onhold ? @dperetti did you manage to workaround this issue while waiting for a permanent solution ?

sherlant commented 2 weeks ago

Hi all, i have the same issue, there is no solution to avoid initdb when we provision a new cluster with existing pv, and existing pgdata ?

cloudnative-pg / cloudnative-pg