Closed KludgeKML closed 2 years ago
It looks like this might actually be a general Postgres problem, rather than anything to do with the Publishing API specifically.
I've just tried importing data for the app service-manual-publisher
, which has a much smaller database at only 12 MB (gzipped). That exhibited the same problem.
We've recently upgraded from Postgres 9.6 to Postgres 13. I wonder if the format of the database dumps have changed. Previously isn't wasn't normal for the CLI to print the database dump to the screen as it seems to do now.
So, are we sure these files are being gzipped correctly? If I try to look at the dump I've got from the replication script:
keith.lawrence@GDS11342 postgresql % pwd
/Users/keith.lawrence/govuk/govuk-docker/replication/postgresql
keith.lawrence@GDS11342 postgresql % gunzip publishing_api_production.dump.gz
gunzip: publishing_api_production.dump.gz: not in gzip format
and if I look in the file:
keith.lawrence@GDS11342 postgresql % head -n10 publishing_api_production.dump.gz
PGDMzpublishing_api_production13.313.3 (Debian 13.3-1.pgdg100+1)??0ENCODINENCODINGSET client_encoding = 'UTF8';
false?00
STDSTRINGS
STDSTRINGS(SET standard_conforming_strings = 'on';
false?00
SEARCHPATH
SEARCHPATH8SELECT pg_catalog.set_config('search_path', '', false);
false?1262 217347371publishing_api_productioDATABASEnCREATE DATABASE publishing_api_production WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE = 'en_US.UTF-8';
)DROP DATABASE publishing_api_production;
aws_db_adminfalse?00"DATABASE publishing_api_productionACLCGRANT ALL ON DATABASE publishing_api_production TO publishing_api;
..which doesn't look like a gz archive.
Is it possible the files are not being compressed in the new scripts (or the compress is failing quietly), and that's causing the import to script to fail?
It strongly hints that something is going wrong in the backup, I think, The replication script might be okay, but the script that's dumping the data out of integration is maybe having problems? (I think I lack the permissions to investigate that further, and I certainly lack the product knowledge at the moment!)
Worth tagging @barrucadu here who worked on the changes for this for the Postgres 13.
Re the .gz, one of the annoyances is that these dumps are named as gz files when they're actually the pgdump custom format. A good deed would be to change that in the file that creates the dump - but it does look a bit fiddly: https://github.com/alphagov/govuk-puppet/blob/c234f285227901fc027abe1fd49f3afed7266077/modules/govuk_env_sync/files/govuk_env_sync.sh#L179-L181 - I'm not sure if there's a standard extension for a pg_dump custom format - .pgdump
?
Ah, that explains that. Though even for a different format they look maybe a little bit messed up - take a look at line 9 of the dump there, which has this:
...217347371publishing_api_productioDATABASEnCREATE DATABASE...
which looks like some output has interleaved with some other output accidentally (notice the "n" cut off the end of the database name).
I've given this a go myself on a couple of smaller postgres databases (including service-manual-publisher as Olly stated) and haven't experienced the problem.
E.g. for Service Manual Publisher:
➜ govuk-docker git:(main) gds aws govuk-integration-poweruser ./bin/replicate-postgresql.sh service-manual-publisher
Touch your YubiKey...
Replicating postgres for service-manual-publisher
download: s3://govuk-integration-database-backups/service-manual-publisher-postgres/2022-02-03T05:00:01-service-manual-publisher_production.gz to replication/postgresql/service_manual_publisher_production.dump.gz
stopping running govuk-docker containers...
docker-compose -f [...] down
Removing govuk-docker_postgres-13_run_ed4b80446767 ... done
Removing govuk-docker_postgres-13_run_b6e0ac2e1353 ... done
Removing govuk-docker_postgres-13_run_6a9bdfadf7bb ... done
Removing govuk-docker_postgres-13_run_730f8681e4b2 ... done
Removing govuk-docker_postgres-13_1 ... done
Removing network govuk-docker_default
docker-compose -f [...] config
docker-compose -f [...] up -d postgres-13
Creating network "govuk-docker_default" with the default driver
Creating govuk-docker_postgres-13_1 ... done
waiting for postgres...
docker-compose -f [...] config
docker-compose -f [...] run postgres-13 /usr/bin/psql -h postgres-13 -U postgres -c DROP DATABASE IF EXISTS "service-manual-publisher"
Creating govuk-docker_postgres-13_run ... done
NOTICE: database "service-manual-publisher" does not exist, skipping
DROP DATABASE
docker-compose -f [...] run postgres-13 /usr/bin/createdb -h postgres-13 -U postgres service-manual-publisher
Creating govuk-docker_postgres-13_run ... done
docker-compose -f [...] run postgres-13 /usr/bin/pg_restore -h postgres-13 -U postgres -d service-manual-publisher --no-owner --no-privileges
Creating govuk-docker_postgres-13_run ... done
11.6MiB 0:00:04 [2.64MiB/s] [================================================================================================================================================================================================================================>] 100%
docker-compose -f [...] stop postgres-13
Stopping govuk-docker_postgres-13_1 ... done
A couple of things that come to mind are:
I imagine the best route forward is to test restoring one of our smaller postgres databases to see if that can work - as it'll be way faster for others to try replicate and less painful for testing. Service Manual Publisher seems quite well suited for that as is only about ~12 MB in it's PGdump format.
I'm trying Publishing API myself but not expecting that to finish before I call it a day today.
Edit: not that it helps you much, but Publishing API replicated fine for me:
@KludgeKML it looks like this is a compatibility issue with Docker Compose v2 – which is relatively new, whereas this project was built to work with v1.
This isn't the first time we've had issues with v2 – for example, see #506 which I hoped was fixed by #554.
I experienced the same issue as you when running with Docker Compose v2 – but switching back to Compose v1 seems to do the trick. Obviously that's not a long term fix, but it should get things working for you for the time being.
If you're using Docker for Mac, you should be able to switch back to v1 in the main Preferences pane:
I've just opened #573 which should resolve this issue.
I'm working on something that relies on publishing-api to provide a content object from a content and attachment id, so I need to replicate the publishing-api database locally. If I run the replicate command:
gds aws govuk-integration-readonly --art 6h ./bin/replicate-postgresql.sh publishing-api
(--art 6h to allow the 20Gb file to download)
...it gets the file, then runs through a huge list of docker containers trying to stop them, spins up the PG 13.1 container but then fails to do the import with the error:
Error response from daemon: OCI runtime start failed: cannot start a container that has stopped: unknown
(More details below, from the start of step 2 after all the containers have been stopped/removed)