bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.92k stars 9.18k forks source link

[bitnami/postgresql-ha] can not install with helm #4623

Closed trungphungduc closed 3 years ago

trungphungduc commented 3 years ago

**Which chart: bitnami/postgresql-ha (lastest)

Describe the bug ... could not open relation mapping file "global/pg_filenode.map": No such file or directory ... could not create directory "pg_replslot/repmgr_slot_1001.tmp": No such file or directory

To Reproduce helm install pgpool bitnami/postgresql-ha \ --set volumePermissions.enabled=true \ --set persistence.existingClaim=stgcc-pgpool2-pv-claim \

Expected behavior Working.

Version of Helm and Kubernetes:

Additional context I deploy on CentOS Linux 7 (Core). The reason that caused the error was "--set volumePermissions.enabled=true" When i set "--set volumePermissions.enabled=false" => every thing go OK. When i set "--set volumePermissions.enabled=true" => error appear like the image below.

**stgcc-pgpool2-pv-claim

apiVersion: v1
kind: PersistentVolume
metadata:
  name: stgcc-pgpool2-pv-volume
  labels:
    type: local
    app: stgcc-postgres2
spec:
  storageClassName: local
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: "/home/bitnami_postgresql/stage/cochat/persistenceKube/pgpool"

**

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: stgcc-pgpool2-pv-claim
  labels:
    app: stgcc-postgres2
spec:
  storageClassName: local
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

Screen Shot 2020-12-03 at 15 35 45**

marcosbc commented 3 years ago

Hi @trungphungduc, it is not always required to install with volumePermissions.enabled=false, it depends on your PV type.

Also, note that the flag is only meant to be used when deploying with a new PVC, existing ones should already have correct permissions so no need to specify it.

If you remove it, do you get any other errors? If not I assume we can consider this solved, right?.

js02sixty commented 3 years ago

I have pretty much the same issue, whether or not I use volumePermissions. Data is being populated into the PV, using the default security context of 1001. Not sure if this has anything to do with it, but where the log shows "starting monitoring of node....(ID: 1000)" Should it instead be 1001?

trungphungduc commented 3 years ago

Hi @trungphungduc, it is not always required to install with volumePermissions.enabled=false, it depends on your PV type.

Also, note that the flag is only meant to be used when deploying with a new PVC, existing ones should already have correct permissions so no need to specify it.

If you remove it, do you get any other errors? If not I assume we can consider this solved, right?.

Hi @marcosbc, When i remove, the error (as the picture above) still appear. Here is the command that i use.

helm install stgcc-pgpool-pers bitnami/postgresql-ha \
--set persistence.existingClaim=stgcc-pgpool-pv-claim
marcosbc commented 3 years ago

Sorry, I meant the other way around: It is not always required to install with volumePermissions.enabled=false.

According to the issue's description:

The reason that caused the error was "--set volumePermissions.enabled=true"
When i set "--set volumePermissions.enabled=false" => every thing go OK.
When i set "--set volumePermissions.enabled=true" => error appear like the image below.

So if you set --set volumePermissions.enabled=false it would work right?


@js02sixty It is probably not, it was just a guess of mine. Could you share how are you populating data into the PV? Also, can you confirm the PostgreSQL data is located at /bitnami/postgresql/data inside the PV?

js02sixty commented 3 years ago

chart install settings:

helm install pg bitnami/postgresql-ha --version 6.2.0 -f - <<EOF
postgresql:
  password: JnT7n4lJDr8WKF4lyLS2VvsJ
  repmgrPassword: 0Er7CGvleQjPosJY1iEfkwNm
volumePermissions:
  enabled: true
persistence:
  existingClaim: pg-data
service:
  type: LoadBalancer
  loadBalancerIP: 172.24.16.111
EOF

PV Info:

kubectl get pv pvc-d09632ff-8a81-49d9-a5a0-0d0e6d51a1e2
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM             STORAGECLASS   REASON   AGE
pvc-d09632ff-8a81-49d9-a5a0-0d0e6d51a1e2   10Gi       RWX            Delete           Bound    default/pg-data   nfs-fast                24h

nfs-client provisioner based PVC

╭─root@vhost01 /srv/cluster/nfs02 
╰─# ls -lha default-pg-data-pvc-d09632ff-8a81-49d9-a5a0-0d0e6d51a1e2
total 4.0K
drwxrwxrwx. 5 1001 1001   42 Dec  8 11:05 .
drwxr-xr-x. 9 root root 4.0K Dec  7 10:53 ..
drwx------. 2 1001 1001   48 Dec  8 11:01 conf
drwx------. 9 1001 root  235 Dec  8 11:06 data
drwxr-xr-x. 2 1001 root   25 Dec  8 11:06 lock
╭─root@vhost01 /srv/cluster/nfs02 
╰─# ls -lha default-pg-data-pvc-d09632ff-8a81-49d9-a5a0-0d0e6d51a1e2/data 
total 20K
drwx------. 9 1001 root  235 Dec  8 11:06 .
drwxrwxrwx. 5 1001 1001   42 Dec  8 11:05 ..
drwx------. 6 1001 root   54 Dec  8 11:06 base
-rw-------. 1 1001 root   51 Dec  8 11:06 current_logfiles
-rw-------. 1 1001 root 1.6K Dec  8 11:06 pg_ident.conf
drwx------. 4 1001 root   68 Dec  8 11:06 pg_logical
drwx------. 2 1001 root    6 Dec  8 11:06 pg_replslot
drwx------. 2 1001 root   84 Dec  8 11:06 pg_stat
drwx------. 2 1001 root    6 Dec  8 11:06 pg_stat_tmp
drwx------. 2 1001 root    6 Dec  8 11:06 pg_tblspc
-rw-------. 1 1001 root    3 Dec  8 11:06 PG_VERSION
drwx------. 2 1001 root   18 Dec  8 11:06 pg_xact
-rw-------. 1 1001 root   88 Dec  8 11:06 postgresql.auto.conf
-rw-------. 1 1001 root  249 Dec  8 11:06 postmaster.opts
js02sixty commented 3 years ago

I now know what the problem is. Instead of selecting an existingClaim this time, I chose storageClass for dynamic provisioning. Each pod is now getting their own PV, before they were sharing the one which was causing conflict. I know that chart mariadb-9.0.1 you can select an existingClaim for the first pod and the subsequent replicated pods can be provisioned dynamically with "storageClass".

trungphungduc commented 3 years ago

@marcosbc As you said: So if you set --set volumePermissions.enabled=false it would work right? => it work OK.

But data didn't save at the PCV (as i use for existingClaim).

My goal is that i can save data for persistancy.

trungphungduc commented 3 years ago

@marcosbc I just found the hint, if i add this one: --set persistence.enabled=false => everything OK. if --set persistence.enabled=true => Error (like the above image).

so the full command i want, but cause the problem => install ERROR:

helm install stgcc-pgpool bitnami/postgresql-ha \
--set persistence.enabled=true \
--set persistence.existingClaim=stgcc-pgpool-pv-claim \
--set volumePermissions.enabled=true

full command here ==> install OK, but can not mount persistance data:

helm install stgcc-pgpool bitnami/postgresql-ha \
--set persistence.enabled=false \
--set persistence.existingClaim=stgcc-pgpool-pv-claim \
--set volumePermissions.enabled=true
marcosbc commented 3 years ago

@trungphungduc I see, I thought your issues were specific to the volumePermissions flags, not the persistence ones.

In that case, could you share the full initialization logs from the K8s postgresql-repmgr pod? If you could show them with --set postgresqlImage.debug=true enabled it would be great.

Also, could you check the file permissions inside your PostgreSQL's volume folder /home/bitnami_postgresql/stage/cochat/persistenceKube/pgpool? Specifically the ones in the bitnami/postgresql and bitnami/postgresql/data subdirectories. You can do that with ls -la /path/to/dir

trungphungduc commented 3 years ago

@marcosbc Here are the 3 logs.

  1. First 3 pics, logs pod/stgcc-pgpool-postgresql-ha-postgresql-0 Screen Shot 2020-12-10 at 10 32 34 Screen Shot 2020-12-10 at 10 32 54 Screen Shot 2020-12-10 at 10 33 04

  2. ls -la /home/bitnami_postgresql/stage/cochat/persistenceKube/pgpool Screen Shot 2020-12-10 at 10 41 02

  3. i connect to postgres container by docker exec -it sh Then ls -la bitnami/postgresql and ls -la bitnami/postgresql/data Screen Shot 2020-12-10 at 10 54 07 Screen Shot 2020-12-10 at 10 54 15

marcosbc commented 3 years ago

Hi @trungphungduc, I'm a bit confused. You were using stgcc-pgpool-pv-claim previously which had the data mounted at /home/bitnami_postgresql/stage/cochat/persistenceKube/pgpool. But then I see that the directory is empty, while the directory mounted to the container (which should be empty) is not.

I think I'm missing something, maybe you were using a different volume?

In any case, from the container's POV I cannot see any misconfigured file permission. Could you check that your PV has enough space for your DB? Maybe 1GB is not enough for your case:

spec:
  storageClassName: local
  capacity:
    storage: 1Gi
js02sixty commented 3 years ago

@trungphungduc, I think the whole issue would be avoided if you chose not to use an existing claim, because when you do that, all of the replicas use the same PVC, which is no good, it will constantly write over eachother repeatedly until you get a crash loop. Each replica needs its own PVC, and the only way i can see that working is if you specify a storage class like this...

helm install stgcc-pgpool bitnami/postgresql-ha \
--set persistence.enabled=true \
--set persistence.storageClass=<some storage class> \
--set volumePermissions.enabled=true

the chart will deploy with an existing claim if you set postgresql.replicaCount=1

trungphungduc commented 3 years ago

hi @marcosbc, Let me explain my thought step by step to install postgresql-ha: My goal is to install postgresql-ha and keep my data safe at my Real (physical) Server Linux, so: Step 1. I create a pv, pvc to hold the content => it is empty at this time. Step 2. I install postgresql-ha with intention to mount data to pvc (step 1).

helm install stgcc-pgpool bitnami/postgresql-ha \
--set persistence.enabled=true \ 
--set persistence.existingClaim=stgcc-pgpool-pv-claim \
--set volumePermissions.enabled=true

The question is:

  1. My way to install is correctly or not ?
  2. If not, can you show me the way Result: Error as the pictures above.

    hi @js02sixty, Thanks for your suggesion, i will try your method and get back to you soon. I just step inside kuber world a few days ago, so my skill is not good at at. i need to find out what is storage class and how to use ..... So wait me.

marcosbc commented 3 years ago

@trungphungduc Those steps would be correct yes. You would also probably need to specify --set volumePermissions.enabled=true (refer to our docs).

However, when you shared the screenshot, that directory was empty. After you deploy the chart (step 2), is it still empty?

If it was indeed empty, the other screenshot you shared (showing the contents of /bitnami/postgresql/data) would not make sense because it should show the contents of the PV (which is empty).

trungphungduc commented 3 years ago

Hi @marcosbc,

  1. /home/bitnami_postgresql/stage/cochat/persistenceKube/pgpool is empty. => This dir was indeed empty, because the installed [ postgresql-ha ] got error, so it can not mount data.
  2. About The screenshot (showing the contents of /bitnami/postgresql/data). => I try to connect as fast as possible to take a picture. After a few seconds, that pod will die and continue to create another pod. To put it simply, i can not connect to the pod and can not take a screenshot of "/bitnami/postgresql/data".
marcosbc commented 3 years ago

Hi @trungphungduc,

I try to connect as fast as possible to take a picture. After a few seconds, that pod will die and continue to create another pod.

To avoid that, you can specify the following options to your deployment:

--set 'postgresql.command[0]=sleep,postgresql.command[1]=infinity,postgresql.readinessProbe.enabled=false,postgresql.livenessProbe.enabled=false'

Then, kubectl exec into your postgresql-repmgr container(s) and run the following command:

/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh /opt/bitnami/scripts/postgresql-repmgr/run.sh

If should fail with the same logs as before, but now you have access to the container and it does not die after a few seconds, and you can perform the proper diagnostics. I.e.:

$ ls -la /bitnami/postgresql
$ ls -la /bitnami/postgresql/data

If you could run those commands in the deployment with both volumePermissions.enabled=true and set to false, I think it would be helpful to compare.

trungphungduc commented 3 years ago

Hi @marcosbc, I am a little busy nowaydays due to covid19, i will take a picture for you ASAP. Please keep this post active. Thanks.

stale[bot] commented 3 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

trungphungduc commented 3 years ago

Hi @marcosbc, Sorry for letting you wait, i just comeback. I do as your tutorial. Step 1:

helm install testpgpool bitnami/postgresql-ha \
--set persistence.enabled=true \
--set persistence.existingClaim=stgcc16-pgpool-pv-claim \
--set volumePermissions.enabled=true \
--set postgresql.command[0]=sleep \
--set postgresql.command[1]=infinity \
--set postgresql.readinessProbe.enabled=false \
--set postgresql.livenessProbe.enabled=false \
--set postgresqlImage.debug=true \
--set postgresql.repmgrPassword=123456 \
-n stgcochat

Step 2: run /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh /opt/bitnami/scripts/postgresql-repmgr/run.sh => Result: Screen Shot 2021-01-11 at 14 30 28

Screen Shot 2021-01-11 at 14 33 05

Screen Shot 2021-01-11 at 14 33 22

Step 3: checking my server (Centos 7). Screen Shot 2021-01-11 at 14 35 41 Screen Shot 2021-01-11 at 14 36 12

=> log pod "postgres master" but no result come out ??? (I don't know why) Screen Shot 2021-01-11 at 14 36 54

=> log pod "pgpool": Screen Shot 2021-01-11 at 14 38 53

marcosbc commented 3 years ago

Hi @trungphungduc, this time it looks like your data is getting persisted properly.

As for your error, it seems like if you were stopping (i.e. exit the shell) of the PostgreSQL container before the PgPool container is restarted, so it cannot find any running PostgreSQL.

I would advise to let the PostgreSQL/repmgr service running (and not quitting) and give some time for the PgPool container to show the actual error. Note that you should also start the 2nd PostgreSQL node manually in a separate window. Make sure not to quit any of those before PgPool tries to connect to them.

trungphungduc commented 3 years ago

Hi @marcosbc, i am not stopping anything. Now i run the command as:

helm install testpgpool bitnami/postgresql-ha \
--set persistence.enabled=true \
--set persistence.existingClaim=stgcc16-pgpool-pv-claim \
--set volumePermissions.enabled=true \
--set postgresqlImage.debug=true \
--set postgresql.repmgrPassword=123456 \
-n stgcochat

Result:

  1. kubectl exec to pod and run script. Screen Shot 2021-01-13 at 09 23 25

  2. Checking my server (Centos 7) Get all: Screen Shot 2021-01-13 at 09 33 03 Log postgres-0: Screen Shot 2021-01-13 at 09 33 37 **Log postgres-1: Screen Shot 2021-01-13 at 09 34 20 ==> i still got error.

trungphungduc commented 3 years ago

After a few minutes. I logs posgres-0: Screen Shot 2021-01-13 at 09 36 16 The error is the same as i always get.

marcosbc commented 3 years ago

Hi @trungphungduc, note that if you deploy with these options:

--set 'postgresql.command[0]=sleep,postgresql.command[1]=infinity

Then "kubectl logs" will not show anything. So the first screenshot you sent showing the error of a service already running on port 5432 would make sense because you'd try to run PostgreSQL on a service already running the service.

Could you re-deploy in a clean environment, with the previous options (postgresql.command/args/readinessProbe/livenessProbe) and running the command? I would also specify those same options for pgpool (pgpool.command/args/...).

Unfortunately you'd need to kubectl exec into all PostgreSQL nodes to execute the specific run.sh command:

trungphungduc commented 3 years ago

Hi @marcosbc, So i re-deploy:

helm install testpgpool bitnami/postgresql-ha \
--set persistence.enabled=true \
--set persistence.existingClaim=stgcc16-pgpool-pv-claim \
--set volumePermissions.enabled=true \
--set postgresql.command[0]=sleep \
--set postgresql.command[1]=infinity \
--set postgresql.readinessProbe.enabled=false \
--set postgresql.livenessProbe.enabled=false \
--set postgresqlImage.debug=true \
--set postgresql.repmgrPassword=123456 \
-n stgcochat

Step 1: kubectl exec pod/testpgpool-postgresql-ha-postgresql-0 -n stgcochat Screen Shot 2021-01-14 at 11 03 46

Step 2: kubectl exec pod/testpgpool-postgresql-ha-postgresql-1 -n stgcochat Screen Shot 2021-01-14 at 11 05 23

Step 3: kubectl exec -it pod/testpgpool-postgresql-ha-pgpool-86854685-t24xn bash -n stgcochat Screen Shot 2021-01-14 at 11 06 45

Step 4: a few mins later, pod pgpool got CrashLoopBackOff => at this time, i can not "kubectl exec" to pod pgpool anymore, coz container not found ... kubectl get all -n stgcochat Screen Shot 2021-01-14 at 11 09 13

Step 5: logs pgpool kubectl logs pod/testpgpool-postgresql-ha-pgpool-86854685-t24xn -n stgcochat Screen Shot 2021-01-14 at 11 26 20

marcosbc commented 3 years ago

Hi, it seems like you didn't specify pgpool.command/args/etc, so when you deploy PgPool it fails and ends up in CrashLoopBackOff. Please specify those and try again.

Another thing is I can see (on step 2) is that the postgresql-1 node contains data already so the initialization fails. However, it seems like the postgresql-0 node does not present this.

trungphungduc commented 3 years ago

Hi @marcosbc, I run the command as below, is it right ?

helm install testpgpool bitnami/postgresql-ha \
--set persistence.enabled=true \
--set persistence.existingClaim=stgcc16-pgpool-pv-claim \
--set volumePermissions.enabled=true \
--set postgresql.command[0]=sleep \
--set postgresql.command[1]=infinity \
--set postgresql.readinessProbe.enabled=false \
--set postgresql.livenessProbe.enabled=false \
--set postgresqlImage.debug=true \
--set postgresql.repmgrPassword=123456 \
--set pgpool.command[0]=sleep \
--set pgpool.command[1]=infinity \
--set pgpool.readinessProbe.enabled=false \
--set pgpool.livenessProbe.enabled=false \
-n stgcochat

If Yes, then i show more information: kubectl exec to pod postgres/pgpool and run command like: kubectl exec -it pod/testpgpool-postgresql-ha-postgresql-0 -n stgcochat bash /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh /opt/bitnami/scripts/postgresql-repmgr/run.sh

Step 1: postgres-0 Screen Shot 2021-01-15 at 09 06 54

Step 2: postgres-1 Screen Shot 2021-01-15 at 09 07 37

Step 3: pgpool Screen Shot 2021-01-15 at 09 08 50


Step 4: Checking my server Centos 7: Screen Shot 2021-01-15 at 09 05 32

What should i do next ? Can you show me something more concrete example to run .... Thanks in advance.

marcosbc commented 3 years ago

Hi @trungphungduc, I actually just noticed something that may explain all your issues (sorry for not noticing earlier).

In Statefulsets, it should be avoided to use existingClaim because any PVC will be shared by all replicas (in your case, replicas=2). Unfortunately there is no simple fix for this.

That explains in your case why postgresql-1 has data before it is initialized (and even in a new deployment), and it is actually the data from postgresql-0.

Therefore, the only thing I can recommend would be to not rely on existingClaim unless you don't add any replica (i.e. replicas=1).

trungphungduc commented 3 years ago

Hi @marcosbc, Thank for your suggestion. I will test and get back to you soon

erpel commented 3 years ago

I just ran into the same issue while reading though the options. I suggest to either

If any option is preferred, I'd be happy to contribute a PR for it.

javsalgar commented 3 years ago

Hi,

Thanks for your input. During our experience, we've seen several special use cases with regards to existingClaim, so in principle I wouldn't force replicas=1. However, I believe that adding a note to the documentation is something that could help several users. We appreciate that you want to contribute with a PR so feel free to open it and we will take a look :D

stale[bot] commented 3 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 3 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 3 years ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.