BCDevOps / backup-container

A simple container for a simple backup strategy.
Apache License 2.0
42 stars 56 forks source link

Postgres backup verification error - The server failed to start #83

Open MScallion opened 3 years ago

MScallion commented 3 years ago

First time executing the backup was this morning and I cannot get the verification to work. We do not have a public schema in our database, hopefully it is not required as it was not for the backup. I have attempted the verification several times. Any help in steps that I can take to investigate the cause of this issue is appreciated.

Namespace = 30245e-dev.


sh-4.2$ ./backup.sh -s -v all

Verifying backup ...

Settings:
- Database: postgres=patroni-master-dev/mals
- Backup file: /backups/daily/2021-07-27/patroni-master-dev-mals_2021-07-27_11-12-00.sql.gz

Already clean ...

waiting for server to start.....................................................................................................................
The server failed to start within 0h:2m:0s.

The server started in 0h:2m:0s.

waiting for server to shut down.... done
server stopped
Cleaning up ...

[!!ERROR!!] - Backup verification failed: /backups/daily/2021-07-27/patroni-master-dev-mals_2021-07-27_11-12-00.sql.gz

Elapsed time: 0h:2m:1s - Status Code: 1
WadeBarnes commented 3 years ago

In the run above, the verification is failing because the local server instance (used for validation) did not start within the timeout period set for DATABASE_SERVER_TIMEOUT; in this case 2 minutes. You can increase the time by setting DATABASE_SERVER_TIMEOUT to a higher value, for example;

image

There are other factors that can cause the server to start slowly such as CPU and Memory resource allocation. The backup-container performs best when allowed to use "best-effort" resource allocation, which is the default specified in the example template; https://github.com/BCDevOps/backup-container/blob/master/openshift/templates/backup/backup-deploy.yaml#L428-L447

WadeBarnes commented 3 years ago

If your database does not use a "public" schema, you can specify the name of your database's schema via the TABLE_SCHEMA variable. This will tell the backup-container which schema to backup and verify.

https://github.com/BCDevOps/backup-container/blob/master/openshift/templates/backup/backup-deploy.yaml#L293-L297

MScallion commented 3 years ago

Thanks Wade. Those four parameter values are set to the defaults and the table_schema parameter is set to the appropriate schema. I added the timeout parameter with 300 seconds and the verification now times out in 5 minutes.

I have updated the Dockerfile to pull Postgres v13 as our app database is v13. Hopefully that isn't the cause of the issue.

FROM centos/postgresql-13-centos7
WadeBarnes commented 3 years ago

That could be the issue. We haven't tested the backup-container with that version. It could be that some startup parameters have changed. What's the version of the database you're backing up?

MScallion commented 3 years ago

Full version:

 PostgreSQL 13.3 (Debian 13.3-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit

I will redo the backup build using v12 for to see if it is compatible.

WadeBarnes commented 3 years ago

It's typically preferable to have the backup-container use the same or newer version as the database you are backing up. Would you be up for troubleshooting the startup issues and contributing the upgrade back to the project? I can provide you with some guidance on how to go about troubleshooting.

MScallion commented 3 years ago

Certainly. Keep in mind that this is my first and only experience with OpenShift and I am still learning how everything works/interacts.

WadeBarnes commented 3 years ago

Open a terminal on a running version of your backup-container. The you'll run the following command after replacing the <DatabaseName>, <DatabaseUser> and <DatabasePassword> with some real values:

POSTGRESQL_DATABASE=<DatabaseName> \
POSTGRESQL_USER=<DatabaseUser> \
POSTGRESQL_PASSWORD=<DatabasePassword> \
run-postgresql

For the purpose of troubleshooting it does not really matter what the values are for <DatabaseName>, <DatabaseUser> and <DatabasePassword>. Running the command will start start the server and you'll be able to see the log output and hopefully find out where it's failing.

You're basically mimicking what's done by this function, https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L104-L123, without suppressing the log output.

basilv commented 3 years ago

I'm interested in this as well, using Postgis extension of Postgres 13, and just starting to look at using backup container.

MScallion commented 3 years ago

Below is the output;

sh-4.2$ POSTGRESQL_DATABASE=mals \
> POSTGRESQL_USER=postgres \
> POSTGRESQL_PASSWORD=##### \
> run-postgresql
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/pgsql/data/userdata ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... PST8PDT
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    pg_ctl -D /var/lib/pgsql/data/userdata -l logfile start

waiting for server to start....2021-07-28 13:25:13.746 PDT [38077] LOG:  redirecting log output to logging collector process
2021-07-28 13:25:13.746 PDT [38077] HINT:  Future log output will appear in directory "log".
 done
server started
/var/run/postgresql:5432 - accepting connections
createuser: error: creation of new role failed: ERROR:  role "postgres" already exists
WadeBarnes commented 3 years ago

I'm interested in this as well, using Postgis extension of Postgres 13, and just starting to look at using backup container.

Once the upgrade is complete, the Postgis extension would need to be installed on the backup-container for your use case. It should be a simple update to the dockerfile.

WadeBarnes commented 3 years ago

@MScallion, Other then the createuser: error: creation of new role failed: ERROR: role "postgres" already exists, it looks like it's running. So you're issue may be due to the way the container is configured.

What's the output from ./backup.sh -c?

MScallion commented 3 years ago
Settings:
- Run mode: cron

- Backup strategy: rolling
- Current backup type: daily
- Backups to retain:
  - Daily: 1
  - Weekly: 1
  - Monthly: 1
- Current backup folder: /backups/daily/2021-07-28/
- Time Zone: PDT -0700

- Schedule:
  - 0 1 * * * default ./backup.sh -s
  - 0 4 * * * default ./backup.sh -s -v all

- Container Type: postgres
- Databases (filtered by container type):
  - postgres=patroni-master-dev/mals

- FTP server: not configured
- Webhook Endpoint: https://parcsystems.webhook.office.com/webhookb2/14b3e8f2-8035-4b3b-bf09-82f0b981616e@0d413f05-310c-4aad-8859-8943f3ded1f0/IncomingWebhook/8ac18ec5da754f5dbd7701dbb173c63e/9d0970f2-adfc-4c59-b638-053c71980087
- Environment Friendly Name: MALS
- Environment Name (Id): dev
MScallion commented 3 years ago

Should the Environment Name be the namespace?

WadeBarnes commented 3 years ago

Should the Environment Name be the namespace?

Yes, but that shouldn't effect the verification.

MScallion commented 3 years ago

I accepted all of the default values for the build. Below is our param file for the deployment.

Side note, the webhook did not produce notifications in Teams, as desired.

APP_NAME                              =mals-bkup
BACKUP_VOLUME_NAME                    =backup-mals
DAILY_BACKUPS                         ="1"
DATABASE_DEPLOYMENT_NAME              =patroni-dev
DATABASE_NAME                         ="mals"
DATABASE_PASSWORD_KEY_NAME            =app-db-owner-password
DATABASE_USER_KEY_NAME                =app-db-owner-username
ENVIRONMENT_NAME                      ="dev"
ENVIRONMENT_FRIENDLY_NAME             =""
IMAGE_NAMESPACE                       =30245e-tools
MONTHLY_BACKUPS                       ="1"
NAMESPACE_NAME                        ="30245e"
SOURCE_IMAGE_NAME                     =mals-bkup
TABLE_SCHEMA                          =mals_app
TAG_NAME                              =v1
WEBHOOK_URL                           ="https://parcsystems.webhook.office.com/webhookb2/14b3e8f2-8035-4b3b-bf09-82f0b981616e@0d413f05-310c-4aad-8859-8943f3ded1f0/IncomingWebhook/8ac18ec5da754f5dbd7701dbb173c63e/9d0970f2-adfc-4c59-b638-053c71980087"
WEEKLY_BACKUPS                        ="1"
#
#
#
# Parameter Defaults
#
#
# BACKUP_DIR                            =/backups/
# BACKUP_PERIOD                         =""
# BACKUP_STRATEGY                       =rolling
# BACKUP_VOLUME_CLASS                   =netapp-file-backup
# BACKUP_VOLUME_SIZE                    =5Gi
# CONFIG_FILE_NAME                      =backup.conf
# CONFIG_MAP_NAME                       =backup-conf
# CONFIG_MOUNT_PATH                     =/
# CPU_LIMIT                             ="0"
# CPU_REQUEST                           ="0"
# DATABASE_SERVICE_NAME                 ="" 
# FTP_PASSWORD                          =""
# FTP_SECRET_KEY                        =ftp-secret
# FTP_URL                               =""
# FTP_URL_HOST                          =
# FTP_USER                              =""
# MEMORY_LIMIT                          =0Mi
# MEMORY_REQUEST                        =0Mi
# MONGODB_AUTHENTICATION_DATABASE       =""
# MSSQL_SA_PASSWORD                     =
# NAME                                  =backup-postgres
# NUM_BACKUPS                           =""#                                    
# ROLE                                  =backup-container#                                    
# VERIFICATION_VOLUME_CLASS             =netapp-file-standard
# VERIFICATION_VOLUME_MOUNT_PATH        =/var/lib/pgsql/data
# VERIFICATION_VOLUME_NAME              =backup-verification
# VERIFICATION_VOLUME_SIZE              =1Gi#                                    
# WEBHOOK_URL_HOST                      =
MScallion commented 3 years ago

When I updated the app database to v13 I did not contribute the changes because I split the database user into a DB owner and an application proxy user. I understood that this could not be used because it was inconsistent with previous templates. Hopefully this is not an issue for the verification as the backup appears to have accepted the user/password that I provided.

WadeBarnes commented 3 years ago

@MScallion, You are missing environment variables for PATRONI_MASTER_DEV_USER and PATRONI_MASTER_DEV_PASSWORD, as defined here; https://github.com/BCDevOps/backup-container#backupconf

WadeBarnes commented 3 years ago

Examples: DC - https://github.com/bcgov/orgbook-configurations/blob/master/openshift/templates/backup/backup-deploy.yaml#L128-L147 backup.conf - https://github.com/bcgov/orgbook-configurations/blob/master/openshift/templates/backup/config/backup.conf#L39-L40

WadeBarnes commented 3 years ago

That would be causing these lookups to return blank strings and cause the startup to fail; https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L119-L120

WadeBarnes commented 3 years ago

When I updated the app database to v13 I did not contribute the changes because I split the database user into a DB owner and an application proxy user. I understood that this could not be used because it was inconsistent with previous templates. Hopefully this is not an issue for the verification as the backup appears to have accepted the user/password that I provided.

Are you saying you've modified the code, or just configured the container in a very specific way?

MScallion commented 3 years ago

Thanks Wade, I don't see those in the backup-container repository, which is the repo that I cloned. I added the four parameters to the deployment template. I am not sure how to update the config file. Currently I have defined the database as;

postgres=patroni-master-dev/mals

Should I add the two lines as-is? ie are these values to be set the same for all projects or should I update them to reflect project specific components?

 db-bc:5432/aries-vcr
 wallet-bc:5432/agent_bc_wallet
WadeBarnes commented 3 years ago

Those values are a customization that need to be done for each application of the container when you use the backup.conf file to define the backups and schedule. I was just providing examples of the customization done for another project.

The examples were to show: DB_BC_USER, and DB_BC_PASSWORD environment variables are associated with the db-bc:5432/aries-vcr entry in the backup.conf file, and WALLET_BC_USER and WALLET_BC_PASSWORD environment variables are associated with the wallet-bc:5432/agent_bc_wallet entry in the backup.conf file. They map the credentials for the databases, via naming convention for the scripts to pick up.

MScallion commented 3 years ago

Are you saying you've modified the code, or just configured the container in a very specific way?

I replaced the APP_USER/APP_PASSWORD secret values with app-db-owner-username/app-db-owner-password and app-proxy-username/app-proxy-password in order to configure a super user, DB owner and an application user, in an attempt to be in compliance with the principle of least privileges. I then administered the necessary role and grants for the app user.

I also updated the Dockerfile to references:

FROM postgres:13
ENV PATRONI_VERSION=2.0.1

and to include grants that helped to resolve an error early on, possibly as a result of the changes to the users;

RUN chmod a+x /usr/share/scripts/patroni/health_check.sh
RUN chmod a+x /usr/share/scripts/patroni/post_init.sh
MScallion commented 3 years ago

So for our application database we have used the following parameters in the deployment template;

            - name: DATABASE_USER
              valueFrom:
                secretKeyRef:
                  name: ${DATABASE_DEPLOYMENT_NAME}
                  key: ${DATABASE_USER_KEY_NAME}
            - name: DATABASE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: ${DATABASE_DEPLOYMENT_NAME}
                  key: ${DATABASE_PASSWORD_KEY_NAME}

For which I have provided the following values;

DATABASE_DEPLOYMENT_NAME              =patroni-dev (Name of the Secret)
DATABASE_NAME                         ="mals"
DATABASE_PASSWORD_KEY_NAME            =app-db-owner-password
DATABASE_USER_KEY_NAME                =app-db-owner-username
WadeBarnes commented 3 years ago

So for our application database we have used the following parameters in the deployment template;

            - name: DATABASE_USER
              valueFrom:
                secretKeyRef:
                  name: ${DATABASE_DEPLOYMENT_NAME}
                  key: ${DATABASE_USER_KEY_NAME}
            - name: DATABASE_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: ${DATABASE_DEPLOYMENT_NAME}
                  key: ${DATABASE_PASSWORD_KEY_NAME}

For which I have provided the following values;

DATABASE_DEPLOYMENT_NAME              =patroni-dev (Name of the Secret)
DATABASE_NAME                         ="mals"
DATABASE_PASSWORD_KEY_NAME            =app-db-owner-password
DATABASE_USER_KEY_NAME                =app-db-owner-username

Those are the default ones which are used for the legacy mode of operation only. When using the newer cron mode of operation with the backup.conf file you need to update your deployment configuration with the additional "credential" variables as laid out in the documentation here, https://github.com/BCDevOps/backup-container#backupconf. The examples I provided follow those conventions.

MScallion commented 3 years ago

Understood, I will work to add those then post my results.

I had assumed that because the backups execute successfully that the credentials might not be the root cause of the issue. The backups successfully complete in ~2 seconds, condensing the 66mb DB storage into 13mb in the backup file. I just updated the secret username and password pair to use the postgres user and the issue persisted,

MScallion commented 3 years ago

The readme states that the parameter names are prefixed with the service name. Does it have to be exact? If so then I will likely need to alter our service names as they currently include the environment, ie patroni-master-dev.

I am hoping that I can use PATRONI_MASTER_USER, PATRONI_MASTER_PASSWORD for all environments, without having to modify the existing service names.

MScallion commented 3 years ago

The verification again timed out in 5m after I added those parameters to the deployment template and redeployed;

PATRONI_MASTER_DEV_USER
PATRONI_MASTER_DEV_PASSWORD
MScallion commented 3 years ago

@WadeBarnes, I have cloned/updated the orgbook-containers repo and I am able to run both the backup and validation. I am now receiving a validation error stating that the application role, to which I grant all object privileges, does not exist. The grants are required because I have split the user into a DB owner and application proxy user. I am investigating alternate dumpfile options, ie pg_dumpall and/or the -x option, to find a best fit for my current configuration and restore capabilities.

WadeBarnes commented 3 years ago

The readme states that the parameter names are prefixed with the service name. Does it have to be exact? If so then I will likely need to alter our service names as they currently include the environment, ie patroni-master-dev.

I am hoping that I can use PATRONI_MASTER_USER, PATRONI_MASTER_PASSWORD for all environments, without having to modify the existing service names.

No, it has to be exact. It's best practice to keep environment designations (i.e. dev, test, prod) out of your resource names where ever possible. Ideally you want your resource names to be exactly the same in each and every environment, except for image tag used to deploy the images into a given namespace.

WadeBarnes commented 3 years ago

@WadeBarnes, I have cloned/updated the orgbook-containers repo and I am able to run both the backup and validation. I am now receiving a validation error stating that the application role, to which I grant all object privileges, does not exist. The grants are required because I have split the user into a DB owner and application proxy user. I am investigating alternate dumpfile options, ie pg_dumpall and/or the -x option, to find a best fit for my current configuration and restore capabilities.

Would you be able to share the error log?

MScallion commented 3 years ago

Thanks for the tip Wade. One limitation that I see is that we have test and UAT in the same namespace. We will need to review our configuration and naming standards.

I have confirmed that the role does not exist in the generated SQL. The documentation states that pg_dumpall can include the roles and I will work with that to determine the best settings for our requirement.

[!!ERROR!!] - Backup verification failed: /backups/daily/2021-07-31/patroni-master-dev-mals_2021-07-31_01-00-00.sql.gz

The following issues were encountered during backup verification;
Restoring '/backups/daily/2021-07-31/patroni-master-dev-mals_2021-07-31_01-00-00.sql.gz' to '127.0.0.1/mals' ...

ERROR:  role "mals_app_role" does not exist

Restore failed.
basilv commented 3 years ago

FYI, I tried running backup verification with my postgis 13:3-.1 image, within a OpenShift CronJob, and got the following error: find: ‘/var/lib/pgsql/data’: No such file or directory Cleaning up ...

waiting for server to start................................................................................................. The server failed to start within 0h:2m:0s.   The server started in 0h:2m:0s. 

pg_ctl: directory "/var/lib/pgsql/data/userdata" does not exist find: ‘/var/lib/pgsql/data’: No such file or directory Cleaning up ...

Checking the image, /var/lib/postgresql/data exists instead.

The backups seems to run properly (from a CronJob), I just seem to have issues running backup verification. It is possible though that I haven't configured my CronJob properly with the right env variables for the verification.

WadeBarnes commented 3 years ago

@basilv, Do you have a volume mounted to /var/lib/pgsql/data in order to hold the data being restored during the verification?

basilv commented 3 years ago

Oops, yup would help to mount the verification PVC.

basilv commented 3 years ago

Here's the latest result after adding the verification PVC and setting the database startup timeout to 300 seconds waiting for server to start....................................................................................................................................................................................................................................................... The server failed to start within 0h:5m:0s.   The server started in 0h:5m:0s. 

pg_ctl: directory "/var/lib/pgsql/data/userdata" does not exist Already clean ...

[!!ERROR!!] - Backup verification failed: /backups/daily/2021-08-04/fom-db-dev-86-fom_2021-08-04_14-46-14.sql.gz

Note that I'm not using separate roles/accounts like MScallion, so that shouldn't be the issue. Presumably I'm either missing some required setting, or there's a Postgres 13 compatibility issue. I do have the output from /backup.sh -c, provided below:

Listing configuration settings ...

Settings:

WadeBarnes commented 3 years ago

@basilv, Did the trouble you were having here get resolved over here, https://github.com/BCDevOps/backup-container/issues/85?

WadeBarnes commented 3 years ago

@MScallion, have you been able to resolve the issues you were having?

basilv commented 3 years ago

@basilv, Did the trouble you were having here get resolved over here, #85?

No, as per #85 I figured out how to run the backup verification from my cron job, but it still fails as described above. I've manually tested the database restore process and confirmed the backups are working properly, so its just something in the verification script. Not sure if it is a Postgres 13 issue, or a PostGIS issue (e.g. maybe the verification script is skipping initialization needed for PostGIS extension).

WadeBarnes commented 3 years ago

It's failing the start postgres during the verification, which is typically due to the database credentials not being wired up correctly. You could troubleshoot by running the onStartServer scripts manually without the >/dev/null 2>&1 bit so you can see the logs; https://github.com/BCDevOps/backup-container/blob/master/docker/backup.postgres.plugin#L104-L123

basilv commented 3 years ago

Would be nice to have an option to show the logs without having to build a completely different container to remove the >/dev/null 2>&1 bit. Looking at the onStartServer function defined in backup.postgres.plugin, it seems to execute 'run-postgresql', but I don't see that defined anywhere (and would have expected pg_ctl to be used, like it is in onStopServer).

I tried running pg_ctl start/init, here's the results: $ pg_ctl start pg_ctl: directory "/var/lib/postgresql/data" is not a database cluster directory (this is after overriding VERIFICATION_VOLUME_MOUNT_PATH to be this location instead of /var/lib/pgsql/data) $ pg_ctl init initdb: could not look up effective user ID 1007290000: user does not exist (my postgres container runs as user postgres, but the backup container Dockerfile uses user 26, maybe that's the issue?)

Also, I've had a hard time telling from the documentation how to supply the credentials. I'm using Openshift CronJob approach. For the cronjob container, I'm supplying env variables DATABASE_USER and DATABASE_PASSWORD as well as POSTGRESQL_USER and POSTGRESQL_PASSWORD (all values pulled from secret). I'm pretty sure I just need DATABASE_USER & DATABASE_PASSWORD though.

(FYI, I've updated my database to use patroni (with postgres 13 and PostGIS). This hasn't seemed to impact the backup procedure at all - still works fine, so at least that's positive.)

MScallion commented 3 years ago

@MScallion, have you been able to resolve the issues you were having?

No. I would need to update the backup.postgres.plugin file in the BCDevOps/backup-container repo to use pg_dumpall, which I have verified locally, in place of pg_dump -Fc,. I am also not able to pull from my fork of that repo due to security settings. In the short-term I was able to backup and successfully restore the database manually by 1) executing my deployment script to create the database, users and roles then 2) restoring the SQL from the backup. Fortunately I had separated the create database from the create DB objects in my deployment scripts.

I am not able to successfully validate the backup files using the automation provided in the backup-container scripts, due to the grants to the roles, which I have added to the source templates and which are not included in the pg_dump function.

basilv commented 1 year ago

Thanks @joeleonjr , I checked all my prior comments but see no API keys. (I didn't check old edits of comments, no idea how to do that.)

basilv commented 1 year ago

Thanks! @MScallion is on a separate team, so they'll need to respond.