Open kr-nn opened 1 month ago
cc @Ma27 @RaitoBezarius
Hmm, these ensure clauses are in ExecStartPost
, so invoked after postgresql notified systemd that its ready even though it's only available for read-only connections.
First, we should discuss what the expected behavior is. Just waiting in ExecStartPost is perhaps not ideal since the post-hook must be finished before services in the same transaction with after dependency on postgresql get started and I think this also has a timeout?
cc @wolfgangwalther for opinions.
Is there any reason we can't put a while loop guard clause in the generated script to check if the connection is read only?
Regardless the context we should only ensure things after we know we are in a writable state.
Is there any reason we can't put a while loop guard clause in the generated script to check if the connection is read only?
As I mentioned above, this will delay the unit being ready and thus also any other unit with an After=postgresql.service
set. Also, I'm not sure how this interacts with timeouts. Hence the question.
Hm, looking at postStart
here: https://github.com/NixOS/nixpkgs/blob/8f0377b2b83c3ff5d1670f2dff5e6388cc4deb84/nixos/modules/services/databases/postgresql.nix#L563-L605 ...
It seems quite clear that the latter part of this (ensureDatabases / ensureUsers) depends on a read-write connection being available. The while loop before only checks for any connection. Imho, this is clearly wrong.
As I mentioned above, this will delay the unit being ready and thus also any other unit with an After=postgresql.service set. Also, I'm not sure how this interacts with timeouts.
I'd assume that in the case of OP, the biggest delay is in doing the restore - which is waited for with the current implementation anyway. I don't have much experience with neither systemd nor restoring from backup, though.
Oct 06 09:57:04 discourse-data postgres[32806]: [32806] LOG: database system is ready to accept read-only connections
I don't understand, yet, why the database ends up accepting read-only connections after this restore, though. Googling for some random logs for PITR gives me this: https://www.pythian.com/blog/technical-track/your-complete-guide-point-in-time-restore-pitr-using-pg_basebackup. There is a log which never shows that "read-only" line but ends in "is ready to accept connections". Maybe it's possible to fix this by changing the restore procedure somehow?
@kr-nn could you create a reproducer for this in form of a nixos test? Then it would be much clearer what the actual restore commands etc. are.
Absolutely ill update with POC.
That being said, in my experience doing a PITR the databases stay on read only mode while WAL is replaying to prevent issues while the databases are recovering.
The base backup is usually restored, opened in read only mode, the WAL replays and then the databases open in write mode.
The while loop before only checks for any connection. Imho, this is clearly wrong.
Yeah, that's probably the culprit. That postgresql is up is guaranteed there anyways because ExecStartPost gets invoked after psotgresql notified systemd about being able to accept connections.
The ready-state while read-only is signaled to systemd here:
So:
This causes potentially longer delays in two steps: During base backup restore and during WAL replay.
We currently only wait for any connections in postStart
, which will always be fast - but it breaks. If we wait for a read-write connection there, then we need to wait for WAL restore, which could take quite some time. Thus @Ma27's concern about a potential systemd timeout.
If I read the code right, then PostgreSQL will notify systemd with READY=1
after allowing read-only connection, but also after promoting, so allowing write connections. I wonder whether there would be a way to:
postStart
(or rather whether we are still in restore)READY=0
somehow? ...postStart
again?We could also think about removing those three lines via patch:
But I don't know the full implications of that regarding systemd and running a true read-replica.
and if we are still replaying, then issue READY=0 somehow? ...
Hmm, if this even possible? :thinking: Would be interesting to see what implications this has on timeouts and dependencies.
We could also think about removing those three lines via patch:
Can you elaborate?
READY=1 can only be sent from ExecStart
, not the post-hook so just removing it without any further action would just time out systemd on startup.
Maybe it's possible to fix this by changing the restore procedure somehow?
Given the implications such a change would have, I'd prefer to first explore if we can find a different solution. I'd be happy to document that in the manual for others.
and if we are still replaying, then issue READY=0 somehow? ...
Hmm, if this even possible? 🤔 Would be interesting to see what implications this has on timeouts and dependencies.
I don't think so, no ;)
We could also think about removing those three lines via patch:
Can you elaborate? READY=1 can only be sent from
ExecStart
, not the post-hook so just removing it without any further action would just time out systemd on startup.
READY=1 would still be sent by Postgres eventually - after WAL is restored / the node is promoted. But yeah, no idea what to do about read replicas, they might never send it.
Maybe it's possible to fix this by changing the restore procedure somehow?
Given the implications such a change would have, I'd prefer to first explore if we can find a different solution. I'd be happy to document that in the manual for others.
Not sure whether the observed behavior is with recovery_target_action = promote
already? If not, it might be worth exploring that.
Otherwise it would make sense to explore how recovery_target_action = shutdown
works, i.e. when WAL is replayed (before shutdown or not?). If it's replayed before, then this could work, because it would not signal READY=1, yet, according to the postgres code. We'd just need to manually clean up the recovery.signal
file.
The behavior is observed with recovery_target_action=promote
Right now as a work around I comment out the ensure clauses and run a nixos-rebuild test to replay the wal. It works flawlessly.
Describe the bug
When restoring from backup using WAL archive logs the ExecStartPost in systemd (used for EnsureClauses) tries to write using alter user while the databases are read only causing systemd to kill the service instead of letting the database recover.
Steps To Reproduce
Steps to reproduce the behavior:
Expected behavior
The service starts and restores the wal, before EnsureClauses are run the recovery should finish and promote so write access is restored prior to EnsureClauses being executed.
Additional context
ExecStartPost runs right after read-only connections are acceptable
Notify maintainers
@thoughtpolice
Metadata