Migration stalls and never finishes

haubourg commented 1 year ago

Hi again, still testing a migration here with pgcopydb v0.13-1 (from pgdg repo) on ubuntu.

All my attempts never ended, pgcopydb never exits properly. Nothings happens on the processes, and nothing appears in pg_stata_activity.

When killing pgcopydb, the logs suddenly gets filled with old messages indicating i/o error. (sorry, I lost the logs. I'll launch another run to have them)

Doing a pgcopydb compare schema afterwards show ~ 10 missing indices.

Running a pgcopydb restore --post-data raises stale pids :

rhaubourg@argos ~/G/b/a/i/pg_restauration15 (pg15_restore_docker_destination_argos ☡)> pgcopydb copy indexes --index-jobs 12  --resume --not-consistent
09:27:46 3278075 INFO   Running pgcopydb version 0.13-1.pgdg22.04+1 from "/usr/bin/pgcopydb"
09:27:46 3278075 INFO   [SOURCE] Copying database from "postgres://postgres@127.0.0.1:6666/gorenove_fabric"
09:27:46 3278075 INFO   [TARGET] Copying database into "postgres://postgres@127.0.0.1:7777/gorenove_fabric"
09:27:46 3278075 INFO   Schema dump for pre-data and post-data section have been done
09:27:46 3278075 INFO   Pre-data schema has been restored on the target instance
09:27:46 3278075 INFO   All the indexes have been copied to the target instance
09:27:46 3278075 INFO   All the sequences have been copied to the target instance
09:27:46 3278075 INFO   All the large objects have been copied to the target instance
09:27:52 3278075 INFO   Fetched information for 42967 indexes
09:27:52 3278075 INFO   Creating 42967 indexes in the target database using 12 processes
09:28:04 3278102 WARN   Found stale pid 3277392 in file "/tmp/pgcopydb/run/36642348", removing it and creating index spx_1m_partition_49_departement_idx
09:28:04 3278103 WARN   Found stale pid 3277393 in file "/tmp/pgcopydb/run/36642347", removing 

[..snip a dozen of similar messages.. ]

09:28:11 3278110 WARN   Found stale pid 3277399 in file "/tmp/pgcopydb/run/36603177", removing it and creating index ref_admin_commune_pkey
09:28:11 3278106 WARN   Found stale pid 3277395 in file "/tmp/pgcopydb/run/36603179", removing it and creating index sidx_ref_admin_commune_geom

I'm migrating from PG13 to PG15. Target PG15 is aggressively tuned for writing speed with autovaccuum off, fsync off, but I have the same behavior for all my configuration trials.

In journactl, I found a few of these warnings (xxx replaces real values)=:

db=xxxx usr=xxxx, client=172.xxxxx , app=pgcopydb, line=1LOG:  using stale statistics instead of current ones because stats collector is not responding
db=xxxx, usr=xxxx, client=172.xxxxx , app=pgcopydb, line=2STATEMENT:  VACUUM ANALYZE "bdxxxxx"."rel_batiment_groupe_xxxxx"

Any idea of what could be wrong in my setup ?

dimitri commented 1 year ago

When killing pgcopydb, the logs suddenly gets filled with old messages indicating i/o error. (sorry, I lost the logs. I'll launch another run to have them)

Unfortunately from the rest of the message it looks like we need these logs to better understand what's happening. The only other idea I would have at the moment is an overloaded system. Can you check for CPU/RAM/Disk saturation from either the source database system, the pgcopydb environment, or the target database system, while the migration is happening? Maybe also lower your jobs parameters in pgcopydb to adjust?

haubourg commented 1 year ago

Allright, I will launch another migration test and will be back within 24h with real log. Iterations are a bit long :)

As for the OS saturation, below are some grafana graphs of my successive attempts. My config is a test server with 16 cpu, 32 Go of RAM, and HDD not much else happening except the cadvisor service for grafana itself.

I hit a lot of iowait as soon a index creations starts. But you might be right, I hit 80% write speed on my disk , but I see 100% load between migration attempts

I'm suspicious against my soft RAID1 behavior but it might just be that I ask too much concurrency. My first trials were failing some table copy with shared_memory error. I first raised a lot max_connections and max_locks_per_transaction with no success. In the end lowering --table-jobs avoided these issues, but I might be too high . I now use --table-jobs 5 --index-jobs 16. Should I lower those parameters?

The real migration will happen on a 48 CPU / 180 Go RAM with 1To Nvme disks holding the database and ~ and a tablespaces on slow HDD (4 To).

I am not sure what parallelism settings I should use during the real migration. Any rule of thumb would be welcome.

dimitri commented 1 year ago

Rule of thumb for getting started:

--table-jobs is how many concurrent processes your SOURCE system can deal with, mostly IO bound (disk reads), sometime CPU bound (converting on-disk binary data to text format may use more CPU than anticipated depending on the data type)
--index-jobs is how many CPU you can dedicate on the TARGET system to CREATE INDEXes, but if you're limited by IOPS bandwidth then this needs taken into account too.

haubourg commented 1 year ago

Si I ran again the same migration and it seems related to #496

The stdout is stuck on this error (at 20:32 server time) . All the workers stay idle doing nothing. I SIGTERM the workers, and then they spit out their logs into the log file .

023-10-16 20:32:43 3822248 ERROR pgsql.c:2487 [TARGET 66111] [42P10] ERROR:  column "test" is a generated column
2023-10-16 20:32:43 3822248 ERROR pgsql.c:2495 [TARGET 66111] DETAIL:  Generated columns cannot be used in COPY.
2023-10-16 20:32:43 3822248 ERROR pgsql.c:2502 [TARGET 66111] Context: copy "z_tmp_rhaubourg"."ref_admin_commune"("id", "geom", "fid", "nom", "nom_m", "insee_com", "statut", "population", "insee_canton", "insee_arrond", "insee_dep", "insee_reg", "siren_epci", "nom_generated", "test") from stdin
2023-10-16 20:32:43 3822248 ERROR table-data.c:450 Failed to copy data for table with oid 36586396 and part number 0, see above for details

[ here the log was stuck before sending SIGTERM]

upe_ffo_bat_batiment_groupe_id_pk"
2023-10-16 19:16:47 3822258 NOTICE indexes.c:772 CREATE UNIQUE INDEX batiment_groupe_rnc_batiment_groupe_id_pk ON tables_bdnb_v1_2022_10d_2.batiment_groupe_rnc USING btree (batiment_groupe_id);
2023-10-16 19:16:47 3822258 NOTICE indexes.c:944 Creating summary file "/tmp/pgcopydb/run/indexes/36614719.done"
2023-10-16 19:16:47 3822258 NOTICE indexes.c:772 CREATE INDEX "8d46f0be081184f13abfb05509ffc811" ON tables_bdnb_v1_2022_10d_2.batiment_groupe_simulations_dvf USING btree (batiment_groupe_id);
2023-10-16 19:16:47 3822258 NOTICE indexes.c:944 Creating summary file "/tmp/pgcopydb/run/indexes/36614696.done"
2023-10-16 19:16:49 3822258 NOTICE indexes.c:772 CREATE INDEX ix_z_tmp_cbacon_superposition_tup_index ON z_tmp_cbacon.superposition_tup USING btree (index);
2023-10-16 19:16:49 3822258 NOTICE indexes.c:944 Creating summary file "/tmp/pgcopydb/run/indexes/36589845.done"
2023-10-16 19:16:49 3822258 NOTICE indexes.c:1156 Found 1 indexes on target database for table "z_tmp_cbacon"."superposition_tup"
2023-10-16 19:16:49 3822258 NOTICE indexes.c:772 CREATE INDEX idx_geom_groupe ON z_tmp_cbacon.test_dvf_open USING gist (geom_groupe);
2023-10-16 19:16:49 3822258 NOTICE indexes.c:944 Creating summary file "/tmp/pgcopydb/run/indexes/36593662.done"
2023-10-16 19:16:50 3822258 NOTICE indexes.c:772 CREATE INDEX recette_pies_v07_dev_geom_pies_idx ON z_tmp_mhamdoune.recette_pies_v07_dev USING gist (geom_pies);
2023-10-16 19:16:50 3822258 NOTICE indexes.c:944 Creating summary file "/tmp/pgcopydb/run/indexes/36604040.done"
2023-10-16 19:16:50 3822258 NOTICE indexes.c:1156 Found 2 indexes on target database for table "z_tmp_mhamdoune"."recette_pies_v07_dev"
2023-10-16 19:16:50 3822258 NOTICE indexes.c:772 CREATE INDEX differentiel_iris_v07_v08_code_iris_idx ON z_tmp_mthorel.differentiel_iris_v07_v08 USING btree (code_iris);
2023-10-16 19:16:50 3822258 NOTICE indexes.c:944 Creating summary file "/tmp/pgcopydb/run/indexes/3
[...]
2023-10-17 13:06:41 3822263 ERROR indexes.c:98 CREATE INDEX worker has been interrupted

I can workaround by removing generated columns from source by now and will test again

dimitri commented 9 months ago

Meanwhile support for generated columns have been added to pgcopydb in https://github.com/dimitri/pgcopydb/pull/517, and the PR is part of release https://github.com/dimitri/pgcopydb/releases/tag/v0.14. Closing this issue. Please consider opening again with more information if needed, or opening a new issue. Also please try with current “main” branch before opening an issue, the --fail-fast option keeps getting improved to handle many situations where pgcopydb would just stall otherwise (error handling can be difficult to implement right sometimes).

dimitri / pgcopydb

Migration stalls and never finishes #499