Closed sonawaneshivraj closed 10 months ago
Hi,
Looking at the container code, I see the following in the repmgr_generate_repmgr_config
function
conninfo='user=${REPMGR_USERNAME} $(repmgr_get_conninfo_password) host=${REPMGR_NODE_NETWORK_NAME} dbname=${REPMGR_DATABASE} port=${REPMGR_PORT_NUMBER} connect_timeout=${REPMGR_CONNECT_TIMEOUT}'
This is being consumed by the statefulset
- name: REPMGR_CONNECT_TIMEOUT
value: "30"
- name: REPMGR_RECONNECT_ATTEMPTS
value: "5"
- name: REPMGR_RECONNECT_INTERVAL
value: "5"
And this is set in the postgresql.repmgrConnectTimeout
timeout.
In the case of the repmgr_clone_primary
function, we may need to check if we explicitly need to set the connection info (including the timeout) as the parameter in order to be consumed. Could you confirm that indeed it took two seconds before crashing?
We tested with REPMGR_CONNECT_TIMEOUT
, REPMGR_RECONNECT_ATTEMPTS
, and REPMGR_RECONNECT_INTERVAL
, once set, they do reflect successfully in the repmgr.conf
.
But for the connection attempt in question, these are not the configurations picked up.
In the debug logs it prints the connection info detail, no connect_timeout
here.
INFO: connecting to source node
DETAIL: connection string is: host=metrics-postgresql-ha-postgresql-0.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local port=5432 user=repmgr dbname=repmgr
ERROR: connection to database failed
The log entry above comes from https://github.com/EnterpriseDB/repmgr/blob/a90d1cf3dd8da7eb036a86c42dca99bf46741784/repmgr-action-standby.c#L5766, this prints the complete connection string, no connect_timeout
here.
Hi,
So, if instead of using the -h
, -p
flags in the clone command, we used the following:
-d, --dbname=conninfo conninfo of the upstream node to use for cloning.
This should be enough for including the timeout, is that correct? Could you try launching the chart with diagnosticMode.enabled=true
and try manually running the clone command with the --dbname
flag, does the configuration get consumed?
From the repmgr_clone_primary code, it seems -d is already being used to set the dbname
repmgr_clone_primary() {
info "Cloning data from primary node..."
local -r flags=("-f" "$REPMGR_CONF_FILE" "-h" "$REPMGR_CURRENT_PRIMARY_HOST" "-p" "$REPMGR_CURRENT_PRIMARY_PORT" "-U" "$REPMGR_USERNAME" "-d" "$REPMGR_DATABASE" "-D" "$POSTGRESQL_DATA_DIR" "standby" "clone" "--fast-checkpoint" "--force")
PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}"
}
But as you point out it does seem like -d can be utilized for providing the complete conninfo and not just the dbname, https://repmgr.org/docs/4.2/repmgr-standby-clone.html
Will check further on this, try it out, and get back.
Also, https://repmgr.org/docs/4.2/cloning-cascading.html demonstrates the -f option to specify the conf file to use, but I do not see it documented anywhere, may be worth a try as well.
I tried using conninfo
in the -d
param, it seems to work and picks up the connect_timeout
$ repmgr -h metrics-pg-postgresql-0.metrics-pg-postgresql-headless.nsxdev-dgsinstances-hhi4a0zb.svc.cluster.local -U repmgr -d "dbname=repmgr connect_timeout=30 password=$REPMGR_PASSWORD" -D /data --force standby clone --dry-run
NOTICE: destination directory "/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=metrics-pg-postgresql-0.metrics-pg-postgresql-headless.nsxdev-dgsinstances-hhi4a0zb.svc.cluster.local user=repmgr password=Kts65So3TQQxkNapa2jQPwK1w9uF3PJt connect_timeout=30 dbname=repmgr
DETAIL: current installation size is 198 MB
INFO: "repmgr" extension is installed in database "repmgr"
Hi @sonawaneshivraj,
As I've understood, implementing the conninfo string in the repmgr_clone_primary()
function could solve the issue:
repmgr_clone_primary() {
# Clears WAL directory if existing (pg_basebackup requires the WAL dir to be empty)
local -r waldir=$(postgresql_get_waldir)
if [[ -d "$waldir" ]]; then
info "Deleting existing WAL directory $waldir..."
rm -rf "$waldir" && ensure_dir_exists "$waldir"
fi
info "Cloning data from primary node..."
local -r flags=("-f" "$REPMGR_CONF_FILE" "-h" "$REPMGR_CURRENT_PRIMARY_HOST" "-p" "$REPMGR_CURRENT_PRIMARY_PORT" "-U" "$REPMGR_USERNAME" "-d" "user=${REPMGR_USERNAME} $(repmgr_get_conninfo_password) host=${REPMGR_NODE_NETWORK_NAME} dbname=${REPMGR_DATABASE} port=${REPMGR_PORT_NUMBER} connect_timeout=${REPMGR_CONNECT_TIMEOUT}" "-D" "$POSTGRESQL_DATA_DIR" "standby" "clone" "--fast-checkpoint" "--force")
if [[ "$REPMGR_USE_PASSFILE" = "true" ]]; then
PGPASSFILE="$REPMGR_PASSFILE_PATH" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}"
else
PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}"
fi
}
Could you give it a try?
Hi @sonawaneshivraj,
As I've understood, implementing the conninfo string in the
repmgr_clone_primary()
function could solve the issue:repmgr_clone_primary() { # Clears WAL directory if existing (pg_basebackup requires the WAL dir to be empty) local -r waldir=$(postgresql_get_waldir) if [[ -d "$waldir" ]]; then info "Deleting existing WAL directory $waldir..." rm -rf "$waldir" && ensure_dir_exists "$waldir" fi info "Cloning data from primary node..." local -r flags=("-f" "$REPMGR_CONF_FILE" "-h" "$REPMGR_CURRENT_PRIMARY_HOST" "-p" "$REPMGR_CURRENT_PRIMARY_PORT" "-U" "$REPMGR_USERNAME" "-d" "user=${REPMGR_USERNAME} $(repmgr_get_conninfo_password) host=${REPMGR_NODE_NETWORK_NAME} dbname=${REPMGR_DATABASE} port=${REPMGR_PORT_NUMBER} connect_timeout=${REPMGR_CONNECT_TIMEOUT}" "-D" "$POSTGRESQL_DATA_DIR" "standby" "clone" "--fast-checkpoint" "--force") if [[ "$REPMGR_USE_PASSFILE" = "true" ]]; then PGPASSFILE="$REPMGR_PASSFILE_PATH" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}" else PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}" fi }
Could you give it a try?
Sure, will give this a try.
Can you please help me understand why the user value is "${REPMGR_USERNAME} $(repmgr_get_conninfo_password)" ? What does the repmgr_get_conninfo_password signify?
@sonawaneshivraj my approach is just send in the -d
flag of the repmgr_clone_primary()
function, all the conninfo string. In the repmgr
library scripts, this equals to user=${REPMGR_USERNAME} $(repmgr_get_conninfo_password) host=${REPMGR_NODE_NETWORK_NAME} dbname=${REPMGR_DATABASE} port=${REPMGR_PORT_NUMBER} connect_timeout=${REPMGR_CONNECT_TIMEOUT}
librepmgr.sh:499
conninfo='user=${REPMGR_USERNAME} $(repmgr_get_conninfo_password) host=${REPMGR_NODE_NETWORK_NAME} dbname=${REPMGR_DATABASE} port=${REPMGR_PORT_NUMBER} connect_timeout=${REPMGR_CONNECT_TIMEOUT}'
So my intend is to just replicate that string, as a base point to develop the fix. If you feel a parameter not needed you just can try to omit them.
This could also work (just passing conninfo
to the -d
flag). I just wanted to be sure using directly the env vars.
repmgr_clone_primary() {
# Clears WAL directory if existing (pg_basebackup requires the WAL dir to be empty)
local -r waldir=$(postgresql_get_waldir)
if [[ -d "$waldir" ]]; then
info "Deleting existing WAL directory $waldir..."
rm -rf "$waldir" && ensure_dir_exists "$waldir"
fi
info "Cloning data from primary node..."
local -r flags=("-f" "$REPMGR_CONF_FILE" "-h" "$REPMGR_CURRENT_PRIMARY_HOST" "-p" "$REPMGR_CURRENT_PRIMARY_PORT" "-U" "$REPMGR_USERNAME" "-d" "$conninfo" "-D" "$POSTGRESQL_DATA_DIR" "standby" "clone" "--fast-checkpoint" "--force")
if [[ "$REPMGR_USE_PASSFILE" = "true" ]]; then
PGPASSFILE="$REPMGR_PASSFILE_PATH" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}"
else
PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}"
fi
}
Even just narrowing the logic for this case adding just the connect_timeout
parameter to the -d flag.
repmgr_clone_primary() {
# Clears WAL directory if existing (pg_basebackup requires the WAL dir to be empty)
local -r waldir=$(postgresql_get_waldir)
if [[ -d "$waldir" ]]; then
info "Deleting existing WAL directory $waldir..."
rm -rf "$waldir" && ensure_dir_exists "$waldir"
fi
info "Cloning data from primary node..."
local -r flags=("-f" "$REPMGR_CONF_FILE" "-h" "$REPMGR_CURRENT_PRIMARY_HOST" "-p" "$REPMGR_CURRENT_PRIMARY_PORT" "-U" "$REPMGR_USERNAME" "-d" "$REPMGR_DATABASE connect_timeout=${REPMGR_CONNECT_TIMEOUT}" "-D" "$POSTGRESQL_DATA_DIR" "standby" "clone" "--fast-checkpoint" "--force")
if [[ "$REPMGR_USE_PASSFILE" = "true" ]]; then
PGPASSFILE="$REPMGR_PASSFILE_PATH" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}"
else
PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}"
fi
}
The latter is what I am thinking. I am dealing with an older version 6.5.5 (appVersion: 11.11.0), I need to figure out how do I add this change to an existing image or build a new one with this change. Any pointers will be helpful
Could you provide the exact Chart values you are using? I can only see the result StatefulSet specification.
annotations:
category: Database
apiVersion: v2
appVersion: 11.11.0
dependencies:
- name: common
repository: file://../kubernetes-infra/bitnami-common
version: 1.x.x
- name: commonlib
version: 0.1.0
repository: file://../commonlib
description: Chart for PostgreSQL with HA architecture (using Replication Manager (repmgr) and Pgpool).
engine: gotpl
home: https://github.com/bitnami/charts/tree/master/bitnami/postgresql-ha
icon: https://bitnami.com/assets/stacks/postgresql/img/postgresql-stack-220x234.png
keywords:
- postgresql
- repmgr
- pgpool
- postgres
- database
- sql
- replication
- cluster
- high availability
maintainers:
- email: containers@bitnami.com
name: Bitnami
name: postgresql-ha
sources:
- https://github.com/bitnami/bitnami-docker-postgresql
- https://www.postgresql.org/
version: 6.5.5
Hi @sonawaneshivraj ,
as we talked privately, please try the fix Alberto provided with the latest image, you're using a very old one (older than 2 years ago) and let us know if that fix works for you. Then we will create a new container image with the patch.
Thanks
Hello, I am seeing the same error timeout. Any pointer.
➜ postgresql git:(main) ✗ kubectl run postgresdb-postgresql-client --rm --tty -i --restart='Never' --namespace test-dev --image docker.io/bitnami/postgresql:15.2.0-debian-11-r5 --env="PGPASSWORD=$POSTGRES_ADMIN_PASSWORD" \
--command -- psql --host postgresdb-postgresql -U testme -d testdb -p 5432
If you don't see a command prompt, try pressing enter.
psql: error: connection to server at "postgresdb-postgresql" (10.233.19.75), port 5432 failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?
pod "postgresdb-postgresql-client" deleted
pod test-dev/postgresdb-postgresql-client terminated (Error)
➜ k -n test-dev exec -ti postgresdb-postgresql-client -- bash
I have no name!@postgresdb-postgresql-client:/$ psql -d testdb -U testme
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or directory
Is the server running locally and accepting connections on that socket?
I have no name!@postgresdb-postgresql-client:/$ command terminated with exit code 137
➜ tmp k logs -f postgresdb-postgresql-0 -n test-dev
postgresql 16:58:28.26
postgresql 16:58:28.26 Welcome to the Bitnami postgresql container
postgresql 16:58:28.26 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql 16:58:28.26 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql 16:58:28.27
postgresql 16:58:28.27 DEBUG ==> Configuring libnss_wrapper...
postgresql 16:58:28.29 INFO ==> ** Starting PostgreSQL setup **
postgresql 16:58:28.32 INFO ==> Validating settings in POSTGRESQL_* env vars..
postgresql 16:58:28.33 INFO ==> Loading custom pre-init scripts...
postgresql 16:58:28.34 INFO ==> Initializing PostgreSQL database...
postgresql 16:58:28.34 DEBUG ==> Ensuring expected directories/files exist...
postgresql 16:58:28.37 INFO ==> pg_hba.conf file not detected. Generating it...
postgresql 16:58:28.37 INFO ==> Generating local authentication configuration
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.UTF-8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /bitnami/postgresql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:
/opt/bitnami/postgresql/bin/pg_ctl -D /bitnami/postgresql/data -l logfile start
initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
postgresql 16:58:29.46 INFO ==> Starting PostgreSQL in background...
waiting for server to start....2023-03-13 16:58:29.528 GMT [68] LOG: starting PostgreSQL 15.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-03-13 16:58:29.529 GMT [68] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-03-13 16:58:29.539 GMT [68] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-13 16:58:29.575 GMT [71] LOG: database system was shut down at 2023-03-13 16:58:29 GMT
2023-03-13 16:58:29.620 GMT [68] LOG: database system is ready to accept connections
done
server started
CREATE DATABASE
postgresql 16:58:29.79 INFO ==> Changing password of postgres
ALTER ROLE
postgresql 16:58:29.82 INFO ==> Creating user testme
CREATE ROLE
postgresql 16:58:29.84 INFO ==> Granting access to "testme" to the database "testdb"
GRANT
ALTER DATABASE
postgresql 16:58:29.89 INFO ==> Setting ownership for the 'public' schema database "testdb" to "testdb"
ALTER SCHEMA
postgresql 16:58:29.92 INFO ==> Configuring replication parameters
postgresql 16:58:29.96 INFO ==> Configuring synchronous_replication
postgresql 16:58:29.97 INFO ==> Configuring fsync
postgresql 16:58:30.03 INFO ==> Stopping PostgreSQL...
waiting for server to shut down....2023-03-13 16:58:30.041 GMT [68] LOG: received fast shutdown request
2023-03-13 16:58:30.053 GMT [68] LOG: aborting any active transactions
2023-03-13 16:58:30.055 GMT [68] LOG: background worker "logical replication launcher" (PID 74) exited with exit code 1
2023-03-13 16:58:30.056 GMT [69] LOG: shutting down
2023-03-13 16:58:30.064 GMT [69] LOG: checkpoint starting: shutdown immediate
2023-03-13 16:58:30.201 GMT [69] LOG: checkpoint complete: wrote 927 buffers (5.7%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.089 s, sync=0.021 s, total=0.146 s; sync files=257, longest=0.013 s, average=0.001 s; distance=4222 kB, estimate=4222 kB
2023-03-13 16:58:30.208 GMT [68] LOG: database system is shut down
done
server stopped
postgresql 16:58:30.24 INFO ==> Loading custom scripts...
postgresql 16:58:30.25 INFO ==> Loading user's custom files from /docker-entrypoint-initdb.d ...
postgresql 16:58:30.25 INFO ==> Starting PostgreSQL in background...
waiting for server to start....2023-03-13 16:58:30.290 GMT [152] LOG: pgaudit extension initialized
2023-03-13 16:58:30.306 GMT [152] LOG: starting PostgreSQL 15.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-03-13 16:58:30.306 GMT [152] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-03-13 16:58:30.320 GMT [152] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-13 16:58:30.340 GMT [155] LOG: database system was shut down at 2023-03-13 16:58:30 GMT
2023-03-13 16:58:30.369 GMT [152] LOG: database system is ready to accept connections
done
server started
2023-03-13 16:58:30.386 GMT [160] LOG: connection received: host=[local]
postgresql 16:58:30.39 DEBUG ==> Executing /docker-entrypoint-initdb.d/..2023_03_13_16_58_19.4294736326/init_script.sql
2023-03-13 16:58:30.407 GMT [166] LOG: connection received: host=[local]
2023-03-13 16:58:30.415 GMT [166] LOG: connection authenticated: identity="postgres" method=md5 (/opt/bitnami/postgresql/conf/pg_hba.conf:3)
2023-03-13 16:58:30.415 GMT [166] LOG: connection authorized: user=postgres database=testdb application_name=psql
CREATE EXTENSION
CREATE EXTENSION
GRANT
CREATE DATABASE
CREATE ROLE
GRANT
CREATE SCHEMA
CREATE SCHEMA
CREATE SCHEMA
CREATE SCHEMA
GRANT
GRANT
GRANT
GRANT
2023-03-13 16:58:31.151 GMT [166] LOG: disconnection: session time: 0:00:00.744 user=postgres database=testdb host=[local]
postgresql 16:58:31.15 INFO ==> Enabling remote connections
postgresql 16:58:31.16 INFO ==> Stopping PostgreSQL...
waiting for server to shut down....2023-03-13 16:58:31.173 GMT [152] LOG: received fast shutdown request
2023-03-13 16:58:31.183 GMT [152] LOG: aborting any active transactions
2023-03-13 16:58:31.185 GMT [152] LOG: background worker "logical replication launcher" (PID 158) exited with exit code 1
2023-03-13 16:58:31.185 GMT [153] LOG: shutting down
2023-03-13 16:58:31.192 GMT [153] LOG: checkpoint starting: shutdown immediate
2023-03-13 16:58:31.395 GMT [153] LOG: checkpoint complete: wrote 2144 buffers (13.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.121 s, sync=0.057 s, total=0.211 s; sync files=363, longest=0.037 s, average=0.001 s; distance=14579 kB, estimate=14579 kB
2023-03-13 16:58:31.404 GMT [152] LOG: database system is shut down
done
server stopped
postgresql 16:58:31.47 INFO ==> ** PostgreSQL setup finished! **
postgresql 16:58:31.49 INFO ==> ** Starting PostgreSQL **
2023-03-13 16:58:31.520 GMT [1] LOG: pgaudit extension initialized
2023-03-13 16:58:31.538 GMT [1] LOG: starting PostgreSQL 15.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-03-13 16:58:31.539 GMT [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-03-13 16:58:31.539 GMT [1] LOG: listening on IPv6 address "::", port 5432
2023-03-13 16:58:31.549 GMT [1] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-03-13 16:58:31.571 GMT [183] LOG: database system was shut down at 2023-03-13 16:58:31 GMT
2023-03-13 16:58:31.583 GMT [1] LOG: database system is ready to accept connections
2023-03-13 16:58:37.918 GMT [193] LOG: connection received: host=localhost port=54264
2023-03-13 16:58:47.934 GMT [199] LOG: connection received: host=localhost port=54108
2023-03
Hi,
I will just create an internal task for addressing the issue. We will reach you back here as soon as this gets into an outcome.
I am experiencing the exact same issue as described in the original post.
postgresql-repmgr 21:16:49.07 DEBUG ==> Schema repmgr.repmgr exists!
postgresql-repmgr 21:16:49.07 INFO ==> Rejoining node...
postgresql-repmgr 21:16:49.08 INFO ==> Cloning data from primary node...
WARNING: following problems with command line parameters detected:
-D/--pgdata will be ignored if a repmgr configuration file is provided
NOTICE: destination directory "/bitnami/postgresql/data" provided
ERROR: connection to database failed
DETAIL:
connection to server at "keycloak-postgresql-ha-postgresql-0.keycloak-postgresql-ha-postgresql-headless.keycloak.svc.cluster.local" (15.0.4.112), port 5432 failed: timeout expired
Out of the 3 nodes, only 1 that happens to be scheduled on a different node pool runs into the timeout, though manually scheduling debugging pods to that node and nodepool I have confirmed no packet loss or networking issues exist.
hello, @andricDu and rest who have faced this issue, did you guys manage to find the workaround/solution for this issue? I've been firefighting this since many days. Appreciate if anyone can share the solution, I've also posted question on SO
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
@ssjsk I also faced the same thing. If I run the image entry point and command in diagnostic mode, it works as you said. So I compared diagnostic mode and non-diagnostic mode, and the biggest difference is whether Kubernetes uses readiness or not. It works if you set readiness to false, but when you look at the chart, there is a headlessWithNotReadyAddresses value set. Setting this will cause the readiness.sh script to work, replacing the existing readiness (psql query "SELECT 1"). We are currently checking the exact reason.
hi @ryuseongryong you've found pain point, disabling liveness/readiness probes for this postgres-ha cluster, makes it work smoothly, definitely not ideal to disable probes though. I tried editing kubernetes statefulset for this helm, mainly to change readiness/liveness script but didn't succeed. I hope bitnami team or someone else can find the solution.
Hi Team, we are also facing the same issue. Interestingly, whenever we enable readiness probes, the connection b/w 2 postgres pods is taking ~4 seconds causing the timeout at 2 seconds in establish_db_connection_by_params
function.
From logs in this ticket also, I can see that here it's taking ~4 seconds:
14:46:14.76 ==> Wait for schema repmgr.repmgr on 'metrics-postgresql-ha-postgresql-0.metrics-postgresql-ha-postgresql-headless.nsxi-platform.svc.cluster.local:5432', will try 6 times with 10 delay seconds (TIMEOUT=60)
14:46:18.83 ==> Schema repmgr.repmgr exists!
Do we know the root cause for this (why would readiness probe make the communication slower)? Or do we have any other solution apart from disabling readiness probe?
Thank you for reaching out and for your interest in our project. We'd be happy to assist you, but to do so effectively, we'll need some more details. Could you please create a separate issue filling the Issue template?
Providing this information will help us better understand your situation and assist you in a more targeted manner.
necroing this thread to point out that the work around @aniket20 proposed does indeed work. While I now must live in dishonour for disabling liveness and readiness probes, at least it runs 😆
host=${REPMGR_NODE_NETWORK_NAME}
why flags host=${REPMGR_NODE_NETWORK_NAME} not host=$REPMGR_CURRENT_PRIMARY_HOST ? may be primary is other nodes, in repmgr clone primary stage
Hi @sonawaneshivraj, As I've understood, implementing the conninfo string in the
repmgr_clone_primary()
function could solve the issue:repmgr_clone_primary() { # Clears WAL directory if existing (pg_basebackup requires the WAL dir to be empty) local -r waldir=$(postgresql_get_waldir) if [[ -d "$waldir" ]]; then info "Deleting existing WAL directory $waldir..." rm -rf "$waldir" && ensure_dir_exists "$waldir" fi info "Cloning data from primary node..." local -r flags=("-f" "$REPMGR_CONF_FILE" "-h" "$REPMGR_CURRENT_PRIMARY_HOST" "-p" "$REPMGR_CURRENT_PRIMARY_PORT" "-U" "$REPMGR_USERNAME" "-d" "user=${REPMGR_USERNAME} $(repmgr_get_conninfo_password) host=${REPMGR_NODE_NETWORK_NAME} dbname=${REPMGR_DATABASE} port=${REPMGR_PORT_NUMBER} connect_timeout=${REPMGR_CONNECT_TIMEOUT}" "-D" "$POSTGRESQL_DATA_DIR" "standby" "clone" "--fast-checkpoint" "--force") if [[ "$REPMGR_USE_PASSFILE" = "true" ]]; then PGPASSFILE="$REPMGR_PASSFILE_PATH" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}" else PGPASSWORD="$REPMGR_PASSWORD" debug_execute "${REPMGR_BIN_DIR}/repmgr" "${flags[@]}" fi }
Could you give it a try?
Sure, will give this a try.
Can you please help me understand why the user value is "${REPMGR_USERNAME} $(repmgr_get_conninfo_password)" ? What does the repmgr_get_conninfo_password signify?
Why not consider incorporating this change?
Name and Version
postgresql-ha 6.5.5
What architecture are you using?
None
What steps will reproduce the bug?
We have a 3 replica deployment (Primary + 2 Secondaries), the statefulset deployment details are below. In some of the setups the secondaries go into a crashloopbackoff state with the following error.
On some of our setups, this is hit consistently. No specific steps to reproduce, and normal deployment fails. Easy to see on env with slightly higher latencies
Complete debug log
Are you using any custom parameters or values?
Deployment details as below
What is the expected behavior?
We understand that the timeout is possibly due to some latencies in the failing environments. What we need is a way to specify the timeout value, currently, there seems to be no way to achieve this and it is getting defaulted to 2 secs.
What do you see instead?
When the standby node comes up, before proceeding with cloning the data from the primary, it does some checks on the primary, (more details here) https://github.com/EnterpriseDB/repmgr/blob/a90d1cf3dd8da7eb036a86c42dca99bf46741784/repmgr-action-standby.c#L5751,
during this check, it tries to connect to repmgr db on the primary, this connection attempt times out with the said error.
In absence of an externally specified timeout, this defaults to 2 secs, more details here https://github.com/EnterpriseDB/repmgr/blob/a90d1cf3dd8da7eb036a86c42dca99bf46741784/dbutils.c#L302.
Additional information
Based on the debugging done on our side, the following piece of code in the bitnami scripts controls this workflow
and
The
repmgr_clone_primary
is the one triggering the workflow. And as described above with the absence of a connect_timeout, its getting defaulted to 2 secs, https://github.com/EnterpriseDB/repmgr/blob/a90d1cf3dd8da7eb036a86c42dca99bf46741784/dbutils.c#L302