Postgres service module blocks WAL restoration

kr-nn commented 1 month ago

Describe the bug

When restoring from backup using WAL archive logs the ExecStartPost in systemd (used for EnsureClauses) tries to write using alter user while the databases are read only causing systemd to kill the service instead of letting the database recover.

Steps To Reproduce

Steps to reproduce the behavior:

Configure a simple postgres with a user, set that user to superuser in services.postgresql.EnsureUsers
configure wal archive backups (I used wal-g)
Attempt a restore with removing the data folder and restoring basebackup replaying wal

Expected behavior

The service starts and restores the wal, before EnsureClauses are run the recovery should finish and promote so write access is restored prior to EnsureClauses being executed.

Additional context

ExecStartPost runs right after read-only connections are acceptable

Oct 06 09:57:03 discourse-data postgres[32806]: [32806] LOG:  starting PostgreSQL 15.8 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.3.0, 64-bit
Oct 06 09:57:03 discourse-data postgres[32806]: [32806] LOG:  listening on IPv4 address "0.0.0.0", port 5432
Oct 06 09:57:03 discourse-data postgres[32806]: [32806] LOG:  listening on IPv6 address "::", port 5432
Oct 06 09:57:03 discourse-data postgres[32806]: [32806] LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
Oct 06 09:57:03 discourse-data postgres[32810]: [32810] LOG:  database system was interrupted; last known up at 2024-10-06 12:02:46 GMT
Oct 06 09:57:03 discourse-data postgres[32810]: [32810] LOG:  creating missing WAL directory "pg_wal/archive_status"
Oct 06 09:57:03 discourse-data postgres[32811]: ERROR: 2024/10/06 09:57:03.892925 Archive '00000002.history' does not exist.
Oct 06 09:57:03 discourse-data postgres[32810]: [32810] LOG:  starting point-in-time recovery to earliest consistent point
Oct 06 09:57:03 discourse-data postgres[32810]: [32810] LOG:  starting backup recovery with redo LSN 0/40000D8, checkpoint LSN 0/4000110, on timeline ID 1
Oct 06 09:57:03 discourse-data postgres[32810]: [32810] LOG:  restored log file "000000010000000000000004" from archive
Oct 06 09:57:04 discourse-data postgres[32810]: [32810] LOG:  redo starts at 0/40000D8
Oct 06 09:57:04 discourse-data postgres[32810]: [32810] LOG:  restored log file "000000010000000000000005" from archive
Oct 06 09:57:04 discourse-data postgres[32810]: [32810] LOG:  completed backup recovery with redo LSN 0/40000D8 and end LSN 0/40001B0
Oct 06 09:57:04 discourse-data postgres[32810]: [32810] LOG:  consistent recovery state reached at 0/40001B0
Oct 06 09:57:04 discourse-data postgres[32810]: [32810] LOG:  recovery stopping after reaching consistency
Oct 06 09:57:04 discourse-data postgres[32810]: [32810] LOG:  redo done at 0/40001B0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.08 s
Oct 06 09:57:04 discourse-data postgres[32806]: [32806] LOG:  database system is ready to accept read-only connections
Oct 06 09:57:04 discourse-data postgres[32871]: [32871] ERROR:  cannot execute ALTER ROLE in a read-only transaction
Oct 06 09:57:04 discourse-data postgres[32871]: [32871] STATEMENT:  ALTER ROLE "discourse" login superuser
Oct 06 09:57:04 discourse-data postgresql-post-start[32870]: ERROR:  cannot execute ALTER ROLE in a read-only transaction
Oct 06 09:57:04 discourse-data systemd[1]: postgresql.service: Control process exited, code=exited, status=1/FAILURE

Notify maintainers

@thoughtpolice

Metadata

kyle@discourse-data  /var/lib  nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.10.7, NixOS, 24.11 (Vicuna), 24.11.20240831.12228ff`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.5`
 - nixpkgs: `/nix/store/bd4fmzws6n5542khxbifbkr6nrygi232-source`

Atemu commented 1 month ago

cc @Ma27 @RaitoBezarius

Ma27 commented 1 month ago

Hmm, these ensure clauses are in ExecStartPost, so invoked after postgresql notified systemd that its ready even though it's only available for read-only connections.

First, we should discuss what the expected behavior is. Just waiting in ExecStartPost is perhaps not ideal since the post-hook must be finished before services in the same transaction with after dependency on postgresql get started and I think this also has a timeout?

cc @wolfgangwalther for opinions.

kr-nn commented 1 month ago

Is there any reason we can't put a while loop guard clause in the generated script to check if the connection is read only?

Regardless the context we should only ensure things after we know we are in a writable state.

Ma27 commented 1 month ago

Is there any reason we can't put a while loop guard clause in the generated script to check if the connection is read only?

As I mentioned above, this will delay the unit being ready and thus also any other unit with an After=postgresql.service set. Also, I'm not sure how this interacts with timeouts. Hence the question.

wolfgangwalther commented 2 weeks ago

Hm, looking at postStart here: https://github.com/NixOS/nixpkgs/blob/8f0377b2b83c3ff5d1670f2dff5e6388cc4deb84/nixos/modules/services/databases/postgresql.nix#L563-L605 ...

It seems quite clear that the latter part of this (ensureDatabases / ensureUsers) depends on a read-write connection being available. The while loop before only checks for any connection. Imho, this is clearly wrong.

As I mentioned above, this will delay the unit being ready and thus also any other unit with an After=postgresql.service set. Also, I'm not sure how this interacts with timeouts.

I'd assume that in the case of OP, the biggest delay is in doing the restore - which is waited for with the current implementation anyway. I don't have much experience with neither systemd nor restoring from backup, though.

Oct 06 09:57:04 discourse-data postgres[32806]: [32806] LOG: database system is ready to accept read-only connections

I don't understand, yet, why the database ends up accepting read-only connections after this restore, though. Googling for some random logs for PITR gives me this: https://www.pythian.com/blog/technical-track/your-complete-guide-point-in-time-restore-pitr-using-pg_basebackup. There is a log which never shows that "read-only" line but ends in "is ready to accept connections". Maybe it's possible to fix this by changing the restore procedure somehow?

@kr-nn could you create a reproducer for this in form of a nixos test? Then it would be much clearer what the actual restore commands etc. are.

kr-nn commented 2 weeks ago

Absolutely ill update with POC.

That being said, in my experience doing a PITR the databases stay on read only mode while WAL is replaying to prevent issues while the databases are recovering.

The base backup is usually restored, opened in read only mode, the WAL replays and then the databases open in write mode.

Ma27 commented 2 weeks ago

The while loop before only checks for any connection. Imho, this is clearly wrong.

Yeah, that's probably the culprit. That postgresql is up is guaranteed there anyways because ExecStartPost gets invoked after psotgresql notified systemd about being able to accept connections.

wolfgangwalther commented 2 weeks ago

The ready-state while read-only is signaled to systemd here:

https://github.com/postgres/postgres/blob/e18512c000eeb0da3509c37ef4a3ce0530b396bd/src/backend/postmaster/postmaster.c#L3565-L3582

So:

Base Backup is restored
Read-only connections are allowed and systemd is notified
WAL is restored
Write connections are allowed

This causes potentially longer delays in two steps: During base backup restore and during WAL replay.

We currently only wait for any connections in postStart, which will always be fast - but it breaks. If we wait for a read-write connection there, then we need to wait for WAL restore, which could take quite some time. Thus @Ma27's concern about a potential systemd timeout.

If I read the code right, then PostgreSQL will notify systemd with READY=1 after allowing read-only connection, but also after promoting, so allowing write connections. I wonder whether there would be a way to:

check whether write connections are accepted in postStart (or rather whether we are still in restore)
and if we are still replaying, then issue READY=0 somehow? ...
... hoping that PostgreSQL's READY=1 after promoting would trigger postStart again?

wolfgangwalther commented 2 weeks ago

We could also think about removing those three lines via patch:

https://github.com/postgres/postgres/blob/e18512c000eeb0da3509c37ef4a3ce0530b396bd/src/backend/postmaster/postmaster.c#L3573-L3575

But I don't know the full implications of that regarding systemd and running a true read-replica.

Ma27 commented 2 weeks ago

and if we are still replaying, then issue READY=0 somehow? ...

Hmm, if this even possible? :thinking: Would be interesting to see what implications this has on timeouts and dependencies.

We could also think about removing those three lines via patch:

Can you elaborate? READY=1 can only be sent from ExecStart, not the post-hook so just removing it without any further action would just time out systemd on startup.

Maybe it's possible to fix this by changing the restore procedure somehow?

Given the implications such a change would have, I'd prefer to first explore if we can find a different solution. I'd be happy to document that in the manual for others.

wolfgangwalther commented 2 weeks ago

and if we are still replaying, then issue READY=0 somehow? ...

Hmm, if this even possible? 🤔 Would be interesting to see what implications this has on timeouts and dependencies.

I don't think so, no ;)

We could also think about removing those three lines via patch:

Can you elaborate? READY=1 can only be sent from ExecStart, not the post-hook so just removing it without any further action would just time out systemd on startup.

READY=1 would still be sent by Postgres eventually - after WAL is restored / the node is promoted. But yeah, no idea what to do about read replicas, they might never send it.

Maybe it's possible to fix this by changing the restore procedure somehow?

Given the implications such a change would have, I'd prefer to first explore if we can find a different solution. I'd be happy to document that in the manual for others.

Not sure whether the observed behavior is with recovery_target_action = promote already? If not, it might be worth exploring that.

Otherwise it would make sense to explore how recovery_target_action = shutdown works, i.e. when WAL is replayed (before shutdown or not?). If it's replayed before, then this could work, because it would not signal READY=1, yet, according to the postgres code. We'd just need to manually clean up the recovery.signal file.

kr-nn commented 2 weeks ago

The behavior is observed with recovery_target_action=promote

Right now as a work around I comment out the ensure clauses and run a nixos-rebuild test to replay the wal. It works flawlessly.

NixOS / nixpkgs