IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

CSM DB backup_script update related to number of working threads whil… #1010

Closed williammorrison2 closed 2 years ago

williammorrison2 commented 2 years ago

Purpose

This added feature using the (-j,--jobs = njobs) flag will help minimize the overall backup processes especially on larger DBs. This was a feature that was introduced in PostgreSQL 9.3 and onward. Here is a section of the documentation from the PostgreSQl site:

pg_dump

pg_dump — extract a PostgreSQL database into a script file or other archive file
..........
Options
-j njobs
--jobs=njobs

Run the dump in parallel by dumping njobs tables simultaneously. This option reduces the time of the dump but it also increases the load on the database server. You can only use this option with the directory output format because this is the only output format where multiple processes can write their data at the same time.
pg_dump will open njobs + 1 connections to the database, so make sure your max_connections setting is high enough to accommodate all connections.
Requesting exclusive locks on database objects while running a parallel dump could cause the dump to fail. The reason is that the pg_dump master process requests shared locks on the objects that the worker processes are going to dump later in order to make sure that nobody deletes them and makes them go away while the dump is running. If another client then requests an exclusive lock on a table, that lock will not be granted but will be queued waiting for the shared lock of the master process to be released. Consequently any other access to the table will not be granted either and will queue after the exclusive lock request. This includes the worker process trying to dump the table. Without any precautions this would be a classic deadlock situation. To detect this conflict, the pg_dump worker process requests another shared lock using the NOWAIT option. If the worker process is not granted this shared lock, somebody else must have requested an exclusive lock in the meantime and there is no way to continue with the dump, so pg_dump has no choice but to abort the dump.
For a consistent backup, the database server needs to support synchronized snapshots, a feature that was introduced in PostgreSQL 9.2 for primary servers and 10 for standbys. With this feature, database clients can ensure they see the same data set even though they use different connections. pg_dump -j uses multiple database connections; it connects to the database once with the master process and once again for each worker job. Without the synchronized snapshot feature, the different worker jobs wouldn't be guaranteed to see the same data in each connection, which could lead to an inconsistent backup.
If you want to run a parallel dump of a pre-9.2 server, you need to make sure that the database content doesn't change from between the time the master connects to the database until the last worker job has connected to the database. The easiest way to do this is to halt any data modifying processes (DDL and DML) accessing the database before starting the backup. You also need to specify the --no-synchronized-snapshots parameter when running pg_dump -j against a pre-9.2 PostgreSQL server.

Note: The script will backup each of tables as a separate compressed gz file. Here is a sample output of the directory and files.

# ls -l /var/lib/pgsql/backups/csmdb_19.0_20-09-2021_17_21_21
total 704
-rw-r--r-- 1 root root    121 Sep 20 17:21 2703.dat.gz
-rw-r--r-- 1 root root    184 Sep 20 17:21 2705.dat.gz
-rw-r--r-- 1 root root     97 Sep 20 17:21 2706.dat.gz
-rw-r--r-- 1 root root    141 Sep 20 17:21 2707.dat.gz
-rw-r--r-- 1 root root    185 Sep 20 17:21 2708.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2709.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2710.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2712.dat.gz
-rw-r--r-- 1 root root     73 Sep 20 17:21 2713.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2714.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2715.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2716.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2717.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2718.dat.gz
-rw-r--r-- 1 root root    918 Sep 20 17:21 2719.dat.gz
-rw-r--r-- 1 root root   1338 Sep 20 17:21 2720.dat.gz
-rw-r--r-- 1 root root   1371 Sep 20 17:21 2721.dat.gz
-rw-r--r-- 1 root root   1533 Sep 20 17:21 2722.dat.gz
-rw-r--r-- 1 root root    254 Sep 20 17:21 2723.dat.gz
-rw-r--r-- 1 root root    328 Sep 20 17:21 2724.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2725.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2726.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2727.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2728.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2729.dat.gz
-rw-r--r-- 1 root root    550 Sep 20 17:21 2730.dat.gz
-rw-r--r-- 1 root root    631 Sep 20 17:21 2731.dat.gz
-rw-r--r-- 1 root root    190 Sep 20 17:21 2732.dat.gz
-rw-r--r-- 1 root root    220 Sep 20 17:21 2733.dat.gz
-rw-r--r-- 1 root root    320 Sep 20 17:21 2734.dat.gz
-rw-r--r-- 1 root root    494 Sep 20 17:21 2735.dat.gz
-rw-r--r-- 1 root root  15901 Sep 20 17:21 2737.dat.gz
-rw-r--r-- 1 root root  12736 Sep 20 17:21 2738.dat.gz
-rw-r--r-- 1 root root    251 Sep 20 17:21 2740.dat.gz
-rw-r--r-- 1 root root    294 Sep 20 17:21 2741.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2742.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2743.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2744.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2745.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2746.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2747.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2748.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2749.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2750.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2751.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2752.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2753.dat.gz
-rw-r--r-- 1 root root     25 Sep 20 17:21 2754.dat.gz
-rw-r--r-- 1 root root 498225 Sep 20 17:21 toc.dat

How to Test

We will set a baseline (default value) which can be configured according to the specific environment.

  1. Create various DB sizes and determine the best default worker threads.

Screenshots

I will add.

Open Questions and Pre-Merge TODOs

thanh-lam commented 2 years ago

Note: We need to test these changes with the version of PostgreSQL on RHEL 8.4, which is the OS for the next release.