Open MScallion opened 3 years ago
In the run above, the verification is failing because the local server instance (used for validation) did not start within the timeout period set for DATABASE_SERVER_TIMEOUT
; in this case 2 minutes. You can increase the time by setting DATABASE_SERVER_TIMEOUT
to a higher value, for example;
There are other factors that can cause the server to start slowly such as CPU and Memory resource allocation. The backup-container
performs best when allowed to use "best-effort" resource allocation, which is the default specified in the example template; https://github.com/BCDevOps/backup-container/blob/master/openshift/templates/backup/backup-deploy.yaml#L428-L447
If your database does not use a "public" schema, you can specify the name of your database's schema via the TABLE_SCHEMA
variable. This will tell the backup-container which schema to backup and verify.
Thanks Wade. Those four parameter values are set to the defaults and the table_schema parameter is set to the appropriate schema. I added the timeout parameter with 300 seconds and the verification now times out in 5 minutes.
I have updated the Dockerfile to pull Postgres v13 as our app database is v13. Hopefully that isn't the cause of the issue.
FROM centos/postgresql-13-centos7
That could be the issue. We haven't tested the backup-container with that version. It could be that some startup parameters have changed. What's the version of the database you're backing up?
Full version:
PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
I will redo the backup build using v12 for to see if it is compatible.
It's typically preferable to have the backup-container use the same or newer version as the database you are backing up. Would you be up for troubleshooting the startup issues and contributing the upgrade back to the project? I can provide you with some guidance on how to go about troubleshooting.
Certainly. Keep in mind that this is my first and only experience with OpenShift and I am still learning how everything works/interacts.
Open a terminal on a running version of your backup-container
. The you'll run the following command after replacing the <DatabaseName>
, <DatabaseUser>
and <DatabasePassword>
with some real values:
POSTGRESQL_DATABASE=<DatabaseName> \
POSTGRESQL_USER=<DatabaseUser> \
POSTGRESQL_PASSWORD=<DatabasePassword> \
run-postgresql
For the purpose of troubleshooting it does not really matter what the values are for <DatabaseName>
, <DatabaseUser>
and <DatabasePassword>
. Running the command will start start the server and you'll be able to see the log output and hopefully find out where it's failing.
You're basically mimicking what's done by this function, https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L104-L123, without suppressing the log output.
I'm interested in this as well, using Postgis extension of Postgres 13, and just starting to look at using backup container.
Below is the output;
sh-4.2$ POSTGRESQL_DATABASE=mals \
> POSTGRESQL_USER=postgres \
> POSTGRESQL_PASSWORD=##### \
> run-postgresql
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... PST8PDT
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start
waiting for server to start....2021-07-28 13:25:13.746 PDT [38077] LOG: redirecting log output to logging collector process
2021-07-28 13:25:13.746 PDT [38077] HINT: Future log output will appear in directory "log".
done
server started
/var/run/postgresql:5432 - accepting connections
createuser: error: creation of new role failed: ERROR: role "postgres" already exists
I'm interested in this as well, using Postgis extension of Postgres 13, and just starting to look at using backup container.
Once the upgrade is complete, the Postgis extension would need to be installed on the backup-container
for your use case. It should be a simple update to the dockerfile.
@MScallion, Other then the createuser: error: creation of new role failed: ERROR: role "postgres" already exists
, it looks like it's running. So you're issue may be due to the way the container is configured.
What's the output from ./backup.sh -c
?
Settings:
- Run mode: cron
- Backup strategy: rolling
- Current backup type: daily
- Backups to retain:
- Daily: 1
- Weekly: 1
- Monthly: 1
- Current backup folder: /backups/daily/2021-07-28/
- Time Zone: PDT -0700
- Schedule:
- 0 1 * * * default ./backup.sh -s
- 0 4 * * * default ./backup.sh -s -v all
- Container Type: postgres
- Databases (filtered by container type):
- postgres=patroni-master-dev/mals
- FTP server: not configured
- Webhook Endpoint: https://parcsystems.webhook.office.com/webhookb2/14b3e8f2-8035-4b3b-bf09-82f0b981616e@0d413f05-310c-4aad-8859-8943f3ded1f0/IncomingWebhook/8ac18ec5da754f5dbd7701dbb173c63e/9d0970f2-adfc-4c59-b638-053c71980087
- Environment Friendly Name: MALS
- Environment Name (Id): dev
Should the Environment Name be the namespace?
Should the Environment Name be the namespace?
Yes, but that shouldn't effect the verification.
I accepted all of the default values for the build. Below is our param file for the deployment.
Side note, the webhook did not produce notifications in Teams, as desired.
APP_NAME =mals-bkup
BACKUP_VOLUME_NAME =backup-mals
DAILY_BACKUPS ="1"
DATABASE_DEPLOYMENT_NAME =patroni-dev
DATABASE_NAME ="mals"
DATABASE_PASSWORD_KEY_NAME =app-db-owner-password
DATABASE_USER_KEY_NAME =app-db-owner-username
ENVIRONMENT_NAME ="dev"
ENVIRONMENT_FRIENDLY_NAME =""
IMAGE_NAMESPACE =30245e-tools
MONTHLY_BACKUPS ="1"
NAMESPACE_NAME ="30245e"
SOURCE_IMAGE_NAME =mals-bkup
TABLE_SCHEMA =mals_app
TAG_NAME =v1
WEBHOOK_URL ="https://parcsystems.webhook.office.com/webhookb2/14b3e8f2-8035-4b3b-bf09-82f0b981616e@0d413f05-310c-4aad-8859-8943f3ded1f0/IncomingWebhook/8ac18ec5da754f5dbd7701dbb173c63e/9d0970f2-adfc-4c59-b638-053c71980087"
WEEKLY_BACKUPS ="1"
#
#
#
# Parameter Defaults
#
#
# BACKUP_DIR =/backups/
# BACKUP_PERIOD =""
# BACKUP_STRATEGY =rolling
# BACKUP_VOLUME_CLASS =netapp-file-backup
# BACKUP_VOLUME_SIZE =5Gi
# CONFIG_FILE_NAME =backup.conf
# CONFIG_MAP_NAME =backup-conf
# CONFIG_MOUNT_PATH =/
# CPU_LIMIT ="0"
# CPU_REQUEST ="0"
# DATABASE_SERVICE_NAME =""
# FTP_PASSWORD =""
# FTP_SECRET_KEY =ftp-secret
# FTP_URL =""
# FTP_URL_HOST =
# FTP_USER =""
# MEMORY_LIMIT =0Mi
# MEMORY_REQUEST =0Mi
# MONGODB_AUTHENTICATION_DATABASE =""
# MSSQL_SA_PASSWORD =
# NAME =backup-postgres
# NUM_BACKUPS =""#
# ROLE =backup-container#
# VERIFICATION_VOLUME_CLASS =netapp-file-standard
# VERIFICATION_VOLUME_MOUNT_PATH =/var/lib/pgsql/data
# VERIFICATION_VOLUME_NAME =backup-verification
# VERIFICATION_VOLUME_SIZE =1Gi#
# WEBHOOK_URL_HOST =
When I updated the app database to v13 I did not contribute the changes because I split the database user into a DB owner and an application proxy user. I understood that this could not be used because it was inconsistent with previous templates. Hopefully this is not an issue for the verification as the backup appears to have accepted the user/password that I provided.
@MScallion, You are missing environment variables for PATRONI_MASTER_DEV_USER
and PATRONI_MASTER_DEV_PASSWORD
, as defined here; https://github.com/BCDevOps/backup-container#backupconf
That would be causing these lookups to return blank strings and cause the startup to fail; https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L119-L120
When I updated the app database to v13 I did not contribute the changes because I split the database user into a DB owner and an application proxy user. I understood that this could not be used because it was inconsistent with previous templates. Hopefully this is not an issue for the verification as the backup appears to have accepted the user/password that I provided.
Are you saying you've modified the code, or just configured the container in a very specific way?
Thanks Wade, I don't see those in the backup-container repository, which is the repo that I cloned. I added the four parameters to the deployment template. I am not sure how to update the config file. Currently I have defined the database as;
postgres=patroni-master-dev/mals
Should I add the two lines as-is? ie are these values to be set the same for all projects or should I update them to reflect project specific components?
db-bc:5432/aries-vcr
wallet-bc:5432/agent_bc_wallet
Those values are a customization that need to be done for each application of the container when you use the backup.conf
file to define the backups and schedule. I was just providing examples of the customization done for another project.
The examples were to show:
DB_BC_USER
, and DB_BC_PASSWORD
environment variables are associated with the db-bc:5432/aries-vcr
entry in the backup.conf
file, and WALLET_BC_USER
and WALLET_BC_PASSWORD
environment variables are associated with the wallet-bc:5432/agent_bc_wallet
entry in the backup.conf
file. They map the credentials for the databases, via naming convention for the scripts to pick up.
Are you saying you've modified the code, or just configured the container in a very specific way?
I replaced the APP_USER/APP_PASSWORD secret values with app-db-owner-username/app-db-owner-password and app-proxy-username/app-proxy-password in order to configure a super user, DB owner and an application user, in an attempt to be in compliance with the principle of least privileges. I then administered the necessary role and grants for the app user.
I also updated the Dockerfile to references:
FROM postgres:13
ENV PATRONI_VERSION=2.0.1
and to include grants that helped to resolve an error early on, possibly as a result of the changes to the users;
RUN chmod a+x /usr/share/scripts/patroni/health_check.sh
RUN chmod a+x /usr/share/scripts/patroni/post_init.sh
So for our application database we have used the following parameters in the deployment template;
- name: DATABASE_USER
valueFrom:
secretKeyRef:
name: ${DATABASE_DEPLOYMENT_NAME}
key: ${DATABASE_USER_KEY_NAME}
- name: DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: ${DATABASE_DEPLOYMENT_NAME}
key: ${DATABASE_PASSWORD_KEY_NAME}
For which I have provided the following values;
DATABASE_DEPLOYMENT_NAME =patroni-dev (Name of the Secret)
DATABASE_NAME ="mals"
DATABASE_PASSWORD_KEY_NAME =app-db-owner-password
DATABASE_USER_KEY_NAME =app-db-owner-username
So for our application database we have used the following parameters in the deployment template;
- name: DATABASE_USER valueFrom: secretKeyRef: name: ${DATABASE_DEPLOYMENT_NAME} key: ${DATABASE_USER_KEY_NAME} - name: DATABASE_PASSWORD valueFrom: secretKeyRef: name: ${DATABASE_DEPLOYMENT_NAME} key: ${DATABASE_PASSWORD_KEY_NAME}
For which I have provided the following values;
DATABASE_DEPLOYMENT_NAME =patroni-dev (Name of the Secret) DATABASE_NAME ="mals" DATABASE_PASSWORD_KEY_NAME =app-db-owner-password DATABASE_USER_KEY_NAME =app-db-owner-username
Those are the default ones which are used for the legacy mode of operation only. When using the newer cron
mode of operation with the backup.conf
file you need to update your deployment configuration with the additional "credential" variables as laid out in the documentation here, https://github.com/BCDevOps/backup-container#backupconf. The examples I provided follow those conventions.
Understood, I will work to add those then post my results.
I had assumed that because the backups execute successfully that the credentials might not be the root cause of the issue. The backups successfully complete in ~2 seconds, condensing the 66mb DB storage into 13mb in the backup file. I just updated the secret username and password pair to use the postgres user and the issue persisted,
The readme states that the parameter names are prefixed with the service name. Does it have to be exact? If so then I will likely need to alter our service names as they currently include the environment, ie patroni-master-dev.
I am hoping that I can use PATRONI_MASTER_USER, PATRONI_MASTER_PASSWORD for all environments, without having to modify the existing service names.
The verification again timed out in 5m after I added those parameters to the deployment template and redeployed;
PATRONI_MASTER_DEV_USER
PATRONI_MASTER_DEV_PASSWORD
@WadeBarnes, I have cloned/updated the orgbook-containers repo and I am able to run both the backup and validation. I am now receiving a validation error stating that the application role, to which I grant all object privileges, does not exist. The grants are required because I have split the user into a DB owner and application proxy user. I am investigating alternate dumpfile options, ie pg_dumpall and/or the -x option, to find a best fit for my current configuration and restore capabilities.
The readme states that the parameter names are prefixed with the service name. Does it have to be exact? If so then I will likely need to alter our service names as they currently include the environment, ie patroni-master-dev.
I am hoping that I can use PATRONI_MASTER_USER, PATRONI_MASTER_PASSWORD for all environments, without having to modify the existing service names.
No, it has to be exact. It's best practice to keep environment designations (i.e. dev
, test
, prod
) out of your resource names where ever possible. Ideally you want your resource names to be exactly the same in each and every environment, except for image tag used to deploy the images into a given namespace.
@WadeBarnes, I have cloned/updated the orgbook-containers repo and I am able to run both the backup and validation. I am now receiving a validation error stating that the application role, to which I grant all object privileges, does not exist. The grants are required because I have split the user into a DB owner and application proxy user. I am investigating alternate dumpfile options, ie pg_dumpall and/or the -x option, to find a best fit for my current configuration and restore capabilities.
Would you be able to share the error log?
Thanks for the tip Wade. One limitation that I see is that we have test and UAT in the same namespace. We will need to review our configuration and naming standards.
I have confirmed that the role does not exist in the generated SQL. The documentation states that pg_dumpall can include the roles and I will work with that to determine the best settings for our requirement.
[!!ERROR!!] - Backup verification failed: /backups/daily/2021-07-31/patroni-master-dev-mals_2021-07-31_01-00-00.sql.gz
The following issues were encountered during backup verification;
Restoring '/backups/daily/2021-07-31/patroni-master-dev-mals_2021-07-31_01-00-00.sql.gz' to '127.0.0.1/mals' ...
ERROR: role "mals_app_role" does not exist
Restore failed.
FYI, I tried running backup verification with my postgis 13:3-.1 image, within a OpenShift CronJob, and got the following error: find: ‘/var/lib/pgsql/data’: No such file or directory Cleaning up ...
waiting for server to start.................................................................................................[31m The server failed to start within 0h:2m:0s. [0m [34m The server started in 0h:2m:0s. [0m
pg_ctl: directory "/var/lib/pgsql/data/userdata" does not exist find: ‘/var/lib/pgsql/data’: No such file or directory Cleaning up ...
Checking the image, /var/lib/postgresql/data exists instead.
The backups seems to run properly (from a CronJob), I just seem to have issues running backup verification. It is possible though that I haven't configured my CronJob properly with the right env variables for the verification.
@basilv, Do you have a volume mounted to /var/lib/pgsql/data
in order to hold the data being restored during the verification?
Oops, yup would help to mount the verification PVC.
Here's the latest result after adding the verification PVC and setting the database startup timeout to 300 seconds waiting for server to start.......................................................................................................................................................................................................................................................[31m The server failed to start within 0h:5m:0s. [0m [34m The server started in 0h:5m:0s. [0m
pg_ctl: directory "/var/lib/pgsql/data/userdata" does not exist Already clean ...
[31m[!!ERROR!!] - Backup verification failed: /backups/daily/2021-08-04/fom-db-dev-86-fom_2021-08-04_14-46-14.sql.gz
Note that I'm not using separate roles/accounts like MScallion, so that shouldn't be the issue. Presumably I'm either missing some required setting, or there's a Postgres 13 compatibility issue. I do have the output from /backup.sh -c, provided below:
Listing configuration settings ...[0m
Settings:
Run mode: legacy
Backup strategy: rolling
Current backup type: daily
Backups to retain:
Current backup folder: /backups/daily/2021-08-04/
Schedule:
Container Type: postgres
Databases (filtered by container type):
FTP server: [33mnot configured[0m
Webhook Endpoint: [33mnot configured[0m
Environment Friendly Name: [33mnot configured[0m
Environment Name (Id): [33mnot configured[0m
@basilv, Did the trouble you were having here get resolved over here, https://github.com/BCDevOps/backup-container/issues/85?
@MScallion, have you been able to resolve the issues you were having?
@basilv, Did the trouble you were having here get resolved over here, #85?
No, as per #85 I figured out how to run the backup verification from my cron job, but it still fails as described above. I've manually tested the database restore process and confirmed the backups are working properly, so its just something in the verification script. Not sure if it is a Postgres 13 issue, or a PostGIS issue (e.g. maybe the verification script is skipping initialization needed for PostGIS extension).
It's failing the start postgres during the verification, which is typically due to the database credentials not being wired up correctly. You could troubleshoot by running the onStartServer
scripts manually without the >/dev/null 2>&1
bit so you can see the logs; https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L104-L123
Would be nice to have an option to show the logs without having to build a completely different container to remove the >/dev/null 2>&1 bit. Looking at the onStartServer function defined in backup.postgres.plugin, it seems to execute 'run-postgresql', but I don't see that defined anywhere (and would have expected pg_ctl to be used, like it is in onStopServer).
I tried running pg_ctl start/init, here's the results: $ pg_ctl start pg_ctl: directory "/var/lib/postgresql/data" is not a database cluster directory (this is after overriding VERIFICATION_VOLUME_MOUNT_PATH to be this location instead of /var/lib/pgsql/data) $ pg_ctl init initdb: could not look up effective user ID 1007290000: user does not exist (my postgres container runs as user postgres, but the backup container Dockerfile uses user 26, maybe that's the issue?)
Also, I've had a hard time telling from the documentation how to supply the credentials. I'm using Openshift CronJob approach. For the cronjob container, I'm supplying env variables DATABASE_USER and DATABASE_PASSWORD as well as POSTGRESQL_USER and POSTGRESQL_PASSWORD (all values pulled from secret). I'm pretty sure I just need DATABASE_USER & DATABASE_PASSWORD though.
(FYI, I've updated my database to use patroni (with postgres 13 and PostGIS). This hasn't seemed to impact the backup procedure at all - still works fine, so at least that's positive.)
@MScallion, have you been able to resolve the issues you were having?
No. I would need to update the backup.postgres.plugin file in the BCDevOps/backup-container repo to use pg_dumpall, which I have verified locally, in place of pg_dump -Fc,. I am also not able to pull from my fork of that repo due to security settings. In the short-term I was able to backup and successfully restore the database manually by 1) executing my deployment script to create the database, users and roles then 2) restoring the SQL from the backup. Fortunately I had separated the create database from the create DB objects in my deployment scripts.
I am not able to successfully validate the backup files using the automation provided in the backup-container scripts, due to the grants to the roles, which I have added to the source templates and which are not included in the pg_dump function.
Thanks @joeleonjr , I checked all my prior comments but see no API keys. (I didn't check old edits of comments, no idea how to do that.)
Thanks! @MScallion is on a separate team, so they'll need to respond.
First time executing the backup was this morning and I cannot get the verification to work. We do not have a public schema in our database, hopefully it is not required as it was not for the backup. I have attempted the verification several times. Any help in steps that I can take to investigate the cause of this issue is appreciated.
Namespace = 30245e-dev.