CrunchyData / pgCompare

pgCompare – a straightforward utility crafted to simplify the data comparison process, providing a robust solution for comparing data across various database platforms.
Apache License 2.0
89 stars 12 forks source link
compare data migration oracle postgres

pgCompare

Data Compare


License

Data Compare Made Simple

pgCompare is a Java application designed for use in situations where you are replicating data from different sources and need to validate that post-replication data consistency, including:

At a higher level, pgCompare reads a row from a table and generates two hash values. The initial hash is executed on the primary key column(s), while the second hash is computed on the remaining columns. The hash can be calculated by the database or by the application. These hash values are stored in the pgCompare repository. Representing the original values as a hash minimizes the required space in the repository database and decreases network traffic. A parallel process consistently conducts comparisons on sets of data as they are loaded to expedite the comparison process and avoid row-by-row processing. Ultimately, a summary of the comparison results is displayed and stored in the pgCompare repository.

pgCompare is an open-source project maintained by the team at Crunchy Data and made available under the Apache 2.0 licenses for broader use, testing, and feedback.

Installation

Requirements

Before initiating the build and installation process, ensure the following prerequisites are met:

  1. Java version 21 or higher.
  2. Maven 3.9 or higher.
  3. Postgres version 15 or higher (to use for the pgCompare Data Compare repository).
  4. Necessary JDBC drivers (DB2, Postgres, MySQL, MSSQL and Oracle currently supported).
  5. Postgres connections must not go through a connection pooler like pgBouncer.

Limitations

The following are current limitations of the compare utility:

  1. Date/Timestamp only compared with a precision of second (DDMMYYYYHH24MISS).
  2. Unsupported data types: blob, long, longraw, bytea.
  3. Limitations with data type boolean when performing cross-platform compare.

Upgrading

Version 0.3.0

Enhancements / Fixes

Upgrading to 0.3.0

Due to the changes required for the repository, the repository must be dropped and recreated to upgrade to version 0.3.0.

Compile

Once the prerequisites are met, begin by forking the repository and cloning it to your host machine:

YOUR_GITHUB_UN="<your GitHub username>"
git clone --depth 1 "git@github.com:${YOUR_GITHUB_UN}/pgCompare.git"
cd pgCompare

Compile the Java source:

mvn clean install

Create pgcompare.properties

Copy the pgcompare.properties.sample file to pgcompare.properties and define the repository, source, and target connection parameters. Refer to the Properties section for more details on the settings.

By default, the application looks for the properties file in the execution directory. Use the PGCOMPARE_CONFIG environment variable override the default and point to a file in a different location.

At a minimal the repo-xxxxx parameters are required in the properties file (or specified by environment parameters). Besides the properties file and environment variables, another alternative is to store the property settings in the dc_project table. Settings can be stored in the project_config column in JSON format ({"parameter": "value"}).

Configure Repository Database

pgCompare necessitates a hosted Postgres repository. To configure, connect to a Postgres database and execute the provided pgCompare.sql script in the database directory. The repository may also be created using the --init flag.

java -jar pgcompare.jar --init

Getting Started

Deploy Sample Schema

In the database directory are two scripts to deploy a sample table (EMP) to Oracle (sample_ora.sql) and Postgres (sample_pg.sql). Deploy the table and data to source and target as approriate.

Defining Table Mapping

The initial step involves defining a set of tables to compare, achieved by inserting rows into the dc_table and dc_table_map tables in the pgCompare repository. This is best done using the automated process below.

Automated Table Registry

Use pgCompare to perform a discovery against the target database and populate the dc_table with the results using the following command. The schemas specified in the properties file will be used for the discovery process.

java -jar pgcompare.jar --discover

Manual Table Registry

Example of loading a row into dc_table and dc_table_map:

INSERT INTO dc_table (table_alias)
  VALUES ('emp');

INSERT INTO dc_table_map (tid, dest_type, schema_name, table_name)
  VALUES (1, 'source', 'hr', 'emp');

INSERT INTO dc_table_map (tid, dest_type, schema_name, table_name)
  VALUES (1, 'target', 'HR', 'EMP');

After populating the list of tables, run the following to automatically map columns.

java -jar pgcompare.jar --batch=0 --maponly

Projects

Projects allow for the repository to maintain different mappings for different compare objectives. This allows a central pgCompare repository to be used for multiple compare projects. Each table has a pid column which is the project id. If no project is specified, the default project (pid = 1) is used.

Perform Data Compare

With the table mapping defined, execute the comparison and provide the mandatory batch command line argument:

java -jar pgcompare.jar --batch 0

Using a batch value of 0 will execute the action for all batches. The batch number may also be specified using the environment variable PGCOMPARE-BATCH. The default value for batch number is 0 (all batches).

Debug/Recheck Out-of-Sync Rows

If discrepancies are detected, run the comparison with the 'check' option:

java -jar pgcompare.jar --batch 0 --check

This recheck process is useful when transactions may be in flight during the initial comparison. The recheck only checks the rows that have been flagged with a discrepancy. If the rows still do not match, details will be reported. Otherwise, the rows will be cleared and marked in-sync.

Reference

Column Map

The system will automatically generate a column mapping during the first execution on a table. This column mapping will be stored in the dc_table_column and dc_table_column_map repository tables. This mapping can be performed ahead of time or the generated mapping modified as needed. If a column mapping is present, the program will not perform a remap unless instructed to using the maponly flag.

To create or overwrite current column mappings stored in column_map colum of dc_table, execute the following:

java -jar pgcompare.jar --batch 0 --maponly

Properties

Properties are categorized into four sections: system, repository, source, and target. Each section has specific properties, as described in detail in the documentation. The properties can be specified via a configuration file, environment variables or a combination of both. To use environment variables, the environment variable will be the name of hte property in upper case prefixed with "PGCOMPARE-". For example, batch-fetch-size can be set by using the environment variable PGCOMPARE-BATCH-FETCH-SIZE.

System

Repository

Source

Target

Property Precedence

The system contains default values for every parameter. These can be over-ridden using environment variables, properties file, or values saved in the dc_project table. The following is the order of precedence used:

Data Compare Concepts

pgCompare stores a hash representation of primary key columns and other table columns, reducing row size and storage demands. The utility optimizes network traffic and speeds up the process by using hash functions when comparing similar platforms.

Hash Options

By default, the data is normalized and the hash performed by the Java code. For supported platforms (Oracle 12.1 or higher and Postgres 11 or higher), the program will leverage database functions to perform the hash. Allowing the database to perform the hash does increase CPU load on the hosting server, but reduces network traffic, increases speed, and reduces memory requirements of pgCompare.

Processes

Each comparison involves at least three threads: one for the observer and two for the source and target loader processes. By specifying a mod_column in the dc_tables and increasing parallel_degree, the number of threads can be increased to speed up comparison. Tuning between batch sizes, commit rates, and parallel degree is essential for optimal performance.

Viewing Results

A summary is printed at the end of each run. To view results at a later time, the following SQL may be used in the repository database.

Results from Last Run

WITH mr AS (SELECT max(rid) rid FROM dc_result)
SELECT compare_start, table_name, status, source_cnt total_cnt, equal_cnt, not_equal_cnt, missing_source_cnt+missing_target_cnt missing_cnt
FROM dc_result r
     JOIN mr ON (mr.rid=r.rid)
ORDER BY table_name;

Out of Sync Rows

SELECT coalesce(s.table_name,t.table_name) table_name,
       coalesce(s.batch_nbr, t.batch_nbr) batch_nbr,
       coalesce(s.thread_nbr,t.thread_nbr) thread_nbr,
       CASE WHEN s.compare_result='n' THEN 'out-of-sync'
            WHEN s.compare_result='m' THEN 'missing target'
            WHEN t.compare_result='m' THEN 'missing source'
            ELSE 'unknown'
       END compare_result,
       coalesce(s.pk,t.pk) primary_key       
FROM dc_source s
     CROSS JOIN dc_target t;

Design Principles

The following list outlines some principles that drives the design of this solution:

Performance

Tuning

There are several parameters that can be used to tune pgCompare. The most common parameters to adjust are batch-fetch-size, batch-commit-size, loader-threads, message-queue-size, observer-throttle, and observer-throttle-size.

Batch Sizes

The batch-fetch-size and batch-commit-size control how many rows are fetched from the target/source for each fetch request. Fetching larger number of rows at one time reduces calls to the database but increases memory requirements for pgCompare. The default values of 2000 for each of these is normally the best setting.

Parallel Settings

The loader-threads parameter controls how many child loader threads to start for each fetching thread. Setting loader-threads to 0 disables this async loading of rows to the repository. For smaller table runs, set this value to 0 as this will be faster since there is no handoffs between the thread fetching the data and these loader threads. Testing has shown best results for larger tables with a setting of 4. Do not set the value over 8 as this will increase memory demand significantly. Use the message-queue-size parameter to control the 'buffer' used by the fetching threads to pass data to the loader threads. Default value for message-queue-size is 100. Increasing message-queue-size will increase the memory demand of pgCompare.

Observer Throttle

pgCompare creates temporary, unlogged tables to use for the actual compare operation. Keeping these tables small helps the compare process perform efficiently. To ensure that the fetching threads do not overrun the Observer, the fetch/loader threads will wait periodically to be cleared by the Observer. This is known as Observer throttling and is enabled or disabled by setting observer-throttle to true or false. If Observer throttling is enabled, the fetch/loader threads will pause every 'n' of rows as specified using observer-throttle-size.

Repository Database

The repository database will have a measurable amount of load during the compare process. To ensure optimal performance, consider the following recommendations:

Compute Resources for Compare

pgCompare requires the execution host to allocate 3 - 12 threads per degree of parallelism (can be higher based on loader-threads setting), with each thread utilizing approximately 400 MB of memory.

pgCompare project source code is available subject to the Apache 2.0 license.