Post-migration validations & performance considerations for irregular distributions of table data

swilson-dre commented 1 year ago

Hi @dimitri - Some colleagues and I were actively pursuing a similar data migration solution piping pg_dump to pg_restore and found a couple needs of any pg_dump/pg_restore/COPY based migration that I'm not sure are addressed with pgcopydb. If so, apologies if I missed it in the docs. If not, maybe you can shed some light or take these as a feature request (since your approach is a bit more robust and feature-full).

It is necessary to perform post-migration validations that all objects and row-level data was migrated successfully. The only way to say if a migration is successful is to clearly show that all objects exist in both source and target and all user data/row counts are consistent between source and target. One cannot rely on exit codes alone to definitively say a migration, though it can be assumed that if the underlying utilities (pg_dump/pg_restore/COPY) are doing their job as advertised, was successful and things should be ok with the database objects and table data. We perform high-level validations of these post-migration with side-by-side views for customers to have confidence that their production data was fully maintained.
A weakness of pg_dump/pg_restore is obviously that it is single-threaded per-table unless you use COPY. Otherwise, you will be bottlenecked at your largest tables. We similarly took this approach, using a table chunking strategy to distribute the parallel table COPY commands with as evenly distributed chunks of table data as possible (not so straightforward since you need to compute data distribution histograms to know how to chunk the data). I was curious if I missed this in pgcopydb. If so, this feature would be a useful add, because even running parallel COPYs on tables with unevenly distributed sets of data can become a bottleneck unless each COPY thread is moving approximately the same amount of data.

dimitri commented 1 year ago

Hi @swilson-dre ; thanks for you interest in pgcopydb

Validation that the migration was done correctly is usually done with application specific dashboards rather than generic SQL tidbits such as a row count... even if skytools/londiste had a per-table sha1 implementation that could be useful here (see https://github.com/pgq/londiste/blob/ce2b7dc8472773dbd6a2f8aa951a0b845443276d/londiste/compare.py#L41). I would review a PR that would implement such an operation as long as it's separate from the main commands, i.e. entirely optional.
Have you read the following part of the docs: https://pgcopydb.readthedocs.io/en/latest/design.html#same-table-concurrency ; where same table concurrency is detailed. And then, what happens when you try using option --split-tables-larger-than 100GB or something?

swilson-dre commented 1 year ago

Thanks for the feedback @dimitri and for the second doc link. I had not seen that, but I very much agree that those considerations are key when using a parallelized table chunking COPY. There will always be a bottleneck somewhere. And even with ample network bandwidth and IOPS, the data distribution in the table can still become a problem (and your next bottleneck).

Also, not every table is implemented with primary or unique key indexes I've found. And those that are, oftentimes they're implemented as composite indexes or non-integer columns. Saving our criticism of bad DB design for another day, it happens, and sometimes we walk into that scenario and we have to help move their data from, say, Azure Single Server to Azure Flexible Server. Saving additional criticism about shoddy Postgres implementation for another day, the point is, we have to deal with it sometimes.

The scenario we encountered and overcame (28-hour pg_dump/pg_restore down to 4-hour hybrid pg_dump/pg_restore + parallel COPY migration) was:

An approx. 300 GB Postgres 9.6 database - 2 "big" tables at 49 GB and another at 180 GB, with a handful of smaller tables
No integer-based primary/unique key columns on those big tables, but there were other indexed columns
The 49 GB table had large in-line blobs of data stored as a user defined type and a sha256 bytea primary key column - low row count (40k approx.)
The 180 GB table had a composite primary key index (id bigint + layer_id text), large row count (approx. 350M reasonably sized rows with JSONB data)

Point here is that, after realizing pg_dump/pg_restore alone was too slow, I wanted to do a similar parallel COPY approach with these two large tables and let pg_dump/pg_restore do the rest of the work on the other tables, because it was easy and fast.

The problem was that there were no purely integer primary keys on our big tables for it's ID column, and the clustering of the data wasn't getting evenly distributed across the parallel COPY jobs simply taking the difference in max(id) and min(id) and dividing by the number of threads I wanted to run to get my bucket sizes. This could even potentially happen if there were large gaps in values for an integer-based PK indexed column (e.g. lots of deletes over time, "fragmenting" the table data distribution).

I found that in some cases, indexes weren't used at all, making the chunking queries used by COPY perform slowly due to sequential scans rather than index scans, but and in some cases they were when you looked at the EXPLAIN ANALYZE for some of the SELECT * FROM source.table WHERE id BETWEEN <bucket_lower_boundary> AND <bucket_upper_boundary> queries used to chunk and parallelize the COPY commands. Creating a temporary integer indexes for these tables just for the migration ended up taking hours by itself, even in the smaller 40k table (since full table scans are performed and those massive blobs all had to be read from disk), so this wasn't a viable solution (but could be in other people's cases).

So I ended up resorting to more elaborate tactics like converting the indexed bytea column into integer values and using those values as lower/upper boundaries in the chunking query. A histogram query (such as demonstrated here) provided a sense that this would be the most effective method for evenly balancing the work that each COPY parallel process was doing by selecting the (indexed) column that also provided the most even distribution of rows, maximizing the CPU/IO/bandwidth available and moving data more optimally. For better response, I bumped up those limits temporarily for the migration and tuned the number of buckets to improve timings.

Anyway, I just wanted to share that experience to hopefully give you and others some insight that the integer-based primary key/unique key limitations you documented are very real, but they aren't necessarily showstoppers with a deeper look into the profile of the data you're moving. Also, it is worth saying that not all Postgres data migrations are cookie-cutter operations and that every migration is unique and worth a little analysis before diving in or committing to any timelines for delivery.

dimitri / pgcopydb

Post-migration validations & performance considerations for irregular distributions of table data #147