dimitri / pgcopydb

Copy a Postgres database to a target Postgres server (pg_dump | pg_restore on steroids)
Other
1.16k stars 76 forks source link

SQLITE database locking causes index and constraint creation to fail, --resume doesn't fix completely #881

Open djpenka opened 4 weeks ago

djpenka commented 4 weeks ago

As part of the workaround for #876, I have been manually performing the pgcopydb steps for database migrations with the following steps:

During the pgcopydb copy indexes step, I ran into an issue where the sqlite database became locked, see logs

19:53:11.299 25596 ERROR  Failed to execute statement: insert or replace into process(  pid, ps_type, ps_title, tableoid, partnum, indexoid)values($1, $2, $3, $4, $5, $6)
19:53:11.299 25596 ERROR  [SQLite 5] database is locked
19:53:11.299 25596 ERROR  Failed to execute SQLite query, see above for details
19:53:11.299 25596 ERROR  Failed to track progress in our catalogs, see above for details
19:53:11.299 25596 ERROR  Failed to create index with oid 93662413, see above for details
19:53:16.502 25594 ERROR  Failed to execute statement: delete from process where pid = $1
19:53:16.502 25594 ERROR  [SQLite 5] database is locked
19:53:16.502 25594 ERROR  Failed to execute SQLite query, see above for details
19:53:16.502 25594 WARN   Failed to delete catalog process entry for pid 25594
19:53:16.502 25594 ERROR  Failed to close "/mnt/data/pgcopydb/schema/source.db":
19:53:16.502 25594 ERROR  [SQLite]: unable to close due to unfinalized statements or unfinished backups
19:53:16.603 25584 ERROR  Sub-process 25594 exited with code 12
19:53:21.683 25596 ERROR  Failed to execute statement: delete from process where pid = $1
19:53:21.683 25596 ERROR  [SQLite 5] database is locked
19:53:21.683 25596 ERROR  Failed to execute SQLite query, see above for details
19:53:21.683 25596 WARN   Failed to delete catalog process entry for pid 25596
19:53:21.683 25596 ERROR  Failed to close "/mnt/data/pgcopydb/schema/source.db":
19:53:21.683 25596 ERROR  [SQLite]: unable to close due to unfinalized statements or unfinished backups
19:53:21.708 25584 ERROR  Sub-process 25596 exited with code 12
19:53:21.908 25584 ERROR  Some INDEX worker process(es) have exited with error, see above for details
19:53:21.909 25584 ERROR  Failed to create indexes, see above for details
19:53:21.916 25574 ERROR  Sub-process 25584 exited with code 12
19:53:21.916 25574 ERROR  Some sub-processes have exited with error status, see above for details

This was when I was mostly done creating about 1000 indexes. I got a stream of errors that new processes could not be created, and two current processes were unable to be deleted from the process table. They still show up in pgcopydb list progress --json output. I tried to resume the index creation with pgcopydb copy indexes --resume which completed successfully, along with pgcopydb copy constraints. However, the next step pgcopydb restore post-data --no-owner --no-acl ran into the following errors

14:41:48.527 456060 ERROR  pg_restore: while PROCESSING TOC:
14:41:48.527 456060 ERROR  pg_restore: from TOC entry 10970; 2606 299027 CONSTRAINT participant participant_ak <EXPUNGED>
14:41:48.527 456060 ERROR  pg_restore: error: could not execute query: ERROR:  relation "participant_ak" already exists
14:41:48.527 456060 ERROR  Command was: ALTER TABLE ONLY public.participant
14:41:48.527 456060 ERROR      ADD CONSTRAINT participant_ak UNIQUE (businessobjectid, authorid);
14:41:48.527 456060 ERROR  pg_restore: from TOC entry 10972; 2606 299061 CONSTRAINT participant participant_pk <EXPUNGED>
14:41:48.527 456060 ERROR  pg_restore: error: could not execute query: ERROR:  relation "participant_pk" already exists
14:41:48.527 456060 ERROR  Command was: ALTER TABLE ONLY public.participant
14:41:48.527 456060 ERROR      ADD CONSTRAINT participant_pk PRIMARY KEY (id);
14:41:48.547 456060 ERROR  pg_restore: while PROCESSING TOC:
14:41:48.547 456060 ERROR  pg_restore: from TOC entry 11662; 2606 57076428 CONSTRAINT public.businessobject_28 public.businessobject_28_businessobjectmodelid_id_key <EXPUNGED>
14:41:48.547 456060 ERROR  pg_restore: error:
14:41:48.547 456060 ERROR  could not execute query: ERROR:  relation "public.businessobject_28_businessobjectmodelid_id_key" already exists
14:41:48.547 456060 ERROR  Command was: ALTER TABLE ONLY public."public.businessobject_28"
14:41:48.547 456060 ERROR      ADD CONSTRAINT "public.businessobject_28_businessobjectmodelid_id_key" UNIQUE (businessobjectmodelid, id);
14:41:48.555 456060 ERROR  pg_restore: while PROCESSING TOC:
14:41:48.555 456060 ERROR  pg_restore: from TOC entry 11659; 2606 57076430 CONSTRAINT public.businessobject_28 public.businessobject_28_authorid_businessobjecttag_id_key <EXPUNGED>
14:41:48.556 456060 ERROR  pg_restore:
14:41:48.556 456060 ERROR  error: could not execute query: ERROR:  relation "public.businessobject_28_authorid_businessobjecttag_id_key" already exists
14:41:48.556 456060 ERROR  Command was: ALTER TABLE ONLY public."public.businessobject_28"
14:41:48.556 456060 ERROR      ADD CONSTRAINT "public.businessobject_28_authorid_businessobjecttag_id_key" UNIQUE (authorid, businessobjecttag, id);
14:41:48.570 456060 ERROR  pg_restore: while PROCESSING TOC:
14:41:48.570 456060 ERROR  pg_restore: from TOC entry 11665; 2606 57076426 CONSTRAINT public.businessobject_28 public.businessobject_28_pkey <EXPUNGED>
14:41:48.570 456060 ERROR  pg_restore: error: could not execute query: ERROR:  multiple primary keys for table "public.businessobject_28" are not allowed
14:41:48.570 456060 ERROR  Command was: ALTER TABLE ONLY public."public.businessobject_28"
14:41:48.570 456060 ERROR      ADD CONSTRAINT "public.businessobject_28_pkey" PRIMARY KEY (id);
14:41:49.904 456060 ERROR  pg_restore: from TOC entry 12829; 0 0 INDEX ATTACH public.businessobject_28_authorid_businessobjecttag_id_key <EXPUNGED>
14:41:49.904 456060 ERROR  pg_restore: error: could not execute query: ERROR:  cannot attach index "public.businessobject_28_authorid_businessobjecttag_id_key" as a partition of index "businessobject_ak1"
14:41:49.904 456060 ERROR  DETAIL:  The index "businessobject_ak1" belongs to a constraint in table "businessobject_partition" but no constraint exists for index "public.businessobject_28_authorid_businessobjecttag_id_key".
14:41:49.904 456060 ERROR  Command was: ALTER INDEX public.businessobject_ak1 ATTACH PARTITION public."public.businessobject_28_authorid_businessobjecttag_id_key";
14:41:49.918 456060 ERROR  pg_restore: from TOC entry 12831; 0 0 INDEX ATTACH public.businessobject_28_businessobjectmodelid_id_key <EXPUNGED>
14:41:49.918 456060 ERROR  pg_restore: error: could not execute query: ERROR:  cannot attach index "public.businessobject_28_businessobjectmodelid_id_key" as a partition of index "businessobject_ak"
14:41:49.918 456060 ERROR  DETAIL:  The index "businessobject_ak" belongs to a constraint in table "businessobject_partition" but no constraint exists for index "public.businessobject_28_businessobjectmodelid_id_key".
14:41:49.918 456060 ERROR  Command was: ALTER INDEX public.businessobject_ak ATTACH PARTITION public."public.businessobject_28_businessobjectmodelid_id_key";

Additionally, output from pgcopydb compare schema gives the following

14:58:09.684 465351 INFO   Running pgcopydb version 0.17-1.pgdg22.04+1 from "/usr/bin/pgcopydb"
14:58:09.690 465351 INFO   Using work dir "/mnt/data/pgcopydb"
14:58:09.691 465351 INFO   SOURCE: Connecting to "<SOURCE_DB>"
14:59:33.465 465351 INFO   Fetched information for 445 tables (including 0 tables split in 0 partitions total), with an estimated total of 1200 million tuples and 255 GB on-disk
14:59:33.569 465351 INFO   Fetched information for 1032 indexes (supporting 652 constraints)
14:59:33.578 465351 INFO   Fetching information for 67 sequences
14:59:34.281 465351 INFO   TARGET: Connecting to "<DESTINATION_DB>"
14:59:34.825 465351 INFO   Fetched information for 445 tables (including 0 tables split in 0 partitions total), with an estimated total of 1245 million tuples and 236 GB on-disk
14:59:34.890 465351 INFO   Fetched information for 1032 indexes (supporting 649 constraints)
14:59:34.893 465351 INFO   Fetching information for 67 sequences
14:59:34.982 465351 INFO   [SOURCE] table: 445, index: 1032, constraint: 652, sequence: 67
14:59:34.982 465351 INFO   [TARGET] table: 445, index: 1032, constraint: 649, sequence: 67
14:59:35.062 465351 ERROR  Table public.participant_pk index public.participant_pk is primary on source and not primary on target
14:59:35.062 465351 ERROR  Table public.participant_pk index public.participant_pk is supporting  constraint named participant_pk on source and  on target
14:59:35.062 465351 ERROR  Table public.participant_pk index public.participant_pk constraint participant_pk definition mismatch.
14:59:35.062 465351 INFO   Source index public.participant_pk constraint participant_pk: PRIMARY KEY (id)
14:59:35.062 465351 INFO   Target index public.participant_pk constraint : (null)
14:59:35.084 465351 ERROR  Table public."public.businessobject_28_businessobjectmodelid_id_key" index public."public.businessobject_28_businessobjectmodelid_id_key" is supporting  constraint named "public.businessobject_28_businessobjectmodelid_id_key" on source and  on target
14:59:35.085 465351 ERROR  Table public."public.businessobject_28_businessobjectmodelid_id_key" index public."public.businessobject_28_businessobjectmodelid_id_key" constraint "public.businessobject_28_businessobjectmodelid_id_key" definition mismatch.
14:59:35.085 465351 INFO   Source index public."public.businessobject_28_businessobjectmodelid_id_key" constraint "public.businessobject_28_businessobjectmodelid_id_key": UNIQUE (businessobjectmodelid, id)
14:59:35.085 465351 INFO   Target index public."public.businessobject_28_businessobjectmodelid_id_key" constraint : (null)
14:59:35.085 465351 ERROR  Table public."public.businessobject_28_authorid_businessobjecttag_id_key" index public."public.businessobject_28_authorid_businessobjecttag_id_key" is supporting  constraint named "public.businessobject_28_authorid_businessobjecttag_id_key" on source and  on target
14:59:35.085 465351 ERROR  Table public."public.businessobject_28_authorid_businessobjecttag_id_key" index public."public.businessobject_28_authorid_businessobjecttag_id_key" constraint "public.businessobject_28_authorid_businessobjecttag_id_key" definition mismatch.
14:59:35.085 465351 INFO   Source index public."public.businessobject_28_authorid_businessobjecttag_id_key" constraint "public.businessobject_28_authorid_businessobjecttag_id_key": UNIQUE (authorid, businessobjecttag, id)
14:59:35.085 465351 INFO   Target index public."public.businessobject_28_authorid_businessobjecttag_id_key" constraint : (null)

My initial idea is to manually modify the destination table constraints to match the source, based on the output.

  1. Will that plan work?
  2. Is there a better way to resume this process when locking happens so that everything is created successfully?
  3. What is the best way to fix the state now?
djpenka commented 4 weeks ago

I manually performed the ALTER TABLE ... ADD CONSTRAINT ... USING INDEX steps and now the pgcopydb compare schema commands show no differences

djpenka commented 3 weeks ago

I was able to see this happening live, and can confirm that there is a segmentation fault that occurs with the pgcopydb list progress --json command while the database locking issues are occurring in the pgcopydb copy indexes --index-jobs process