DataExpert-io / data-engineer-handbook

This is a repo with links to everything you'd ever want to learn about data engineering
20.25k stars 3k forks source link

[1-dimensional-data-modeling] pg_restore not idempotent? #133

Open gitgithan opened 3 days ago

gitgithan commented 3 days ago

I noticed running pg_restore -U user -d postgres data.dump repeatedly causes some tables to grow in size. I expected tables to be the same no matter how many restores are done.

This would be confusing users later if they don't know this can happen. Users may restore multiple times when retrying failed attempts.

The 2 images show 1 restore apart. Top section is local, bottom section is docker. imageimage

I'm guessing duplicated data is being inserted after multiple restores.

Is docker non-idempotent behaviour explained at Initialization scripts section of https://hub.docker.com/_/postgres?

Warning: scripts in /docker-entrypoint-initdb.d are only run if you start the container with a data directory that is empty; any pre-existing database will be left untouched on container startup. One common problem is that if one of your /docker-entrypoint-initdb.d scripts fails (which will cause the entrypoint script to exit) and your orchestrator restarts the container with the already initialized data directory, it will not continue on with your scripts.

gitgithan commented 3 days ago

I propose using pg_restore -c --if-exists -U user -d postgres data.dump

--if-exists caters to the 1st run of pg_restore

From man,

   -c
   --clean
       Before restoring database objects, issue commands to DROP all the objects that will be restored. This option is
       useful for overwriting an existing database. If any of the objects do not exist in the destination database,
       ignorable error messages will be reported, unless --if-exists is also specified.

From right to left, 3 pg_restore without -c --if-exists, followed by 1 pg_restore with. To show that the sizes go back to the original size instead of duplicating data.

image

This non-idempotent issue has caused errors that made me second guess whether the restore succeeded. With a -c during pg_restore, these errors should not appear and give users more confidence the restore works.

--------TRUNCATED--------
pg_restore: from TOC entry 3227; 2606 16435 FK CONSTRAINT game_details game_details_game_id_fkey postgres
pg_restore: error: could not execute query: ERROR:  constraint "game_details_game_id_fkey" for relation "game_details" already exists
Command was: ALTER TABLE ONLY public.game_details
    ADD CONSTRAINT game_details_game_id_fkey FOREIGN KEY (game_id) REFERENCES public.games(game_id);

pg_restore: warning: errors ignored on restore: 16

This idea is more relevant to local setups of postgres. For docker which uses initialization scripts, the 1st post in this issue cites that docker only runs initialization scripts once, so adding -c --if-exists in /docker-entrypoint-initdb.d/init-db.sh seems pointless. However these 2 options are still useful if users exec into container to restore again if depending on /docker-entrypoint-initdb.d/init-db.sh fails


Extra investigations

I checked output_file.sql from pg_restore -U user -f output_file.sql data.dump and was surprised to see it drops all constraints and tables, so I expect the restore to be idempotent even without -c or --if-exists. My guess is even though the commands are laid out here, what actually gets run still gets affected by flags used or missing during pg_restore, so I bet these 4 + 9 statements will not run unless -c and --if-exists is used during restore

ALTER TABLE IF EXISTS ONLY public.game_details DROP CONSTRAINT IF EXISTS game_details_game_id_fkey;
ALTER TABLE IF EXISTS ONLY public.player_seasons DROP CONSTRAINT IF EXISTS player_seasons_pkey;
ALTER TABLE IF EXISTS ONLY public.games DROP CONSTRAINT IF EXISTS games_pkey;
ALTER TABLE IF EXISTS ONLY public.actor_films DROP CONSTRAINT IF EXISTS actor_films_pkey;
DROP TABLE IF EXISTS public.users;
DROP TABLE IF EXISTS public.teams;
DROP TABLE IF EXISTS public.player_seasons;
DROP TABLE IF EXISTS public.games;
DROP TABLE IF EXISTS public.game_details;
DROP TABLE IF EXISTS public.events;
DROP TABLE IF EXISTS public.devices;
DROP TABLE IF EXISTS public.arena;
DROP TABLE IF EXISTS public.actor_films;