Open Andrei-Pozolotin opened 1 year ago
note: there are still few more "deadlock candidates" in the code, for example, search for:
while (!got_SIGTERM)
and review (loops are not using CHECK_FOR_INTERRUPTS()
):
pglogical_sync.c
pglogical_sync_worker_cleanup(PGLogicalSubscription *sub)
wait_for_sync_status_change(Oid subid, const char *nspname, const char *relname,
@eulerto please review
I've reviewed this. Note that the linked issues say this began with PostgreSQL 15 commit postgres/postgres@4eb2176318d, but it was Windows-only until postgres/postgres@e2f65f4255. I do recommend merging this PR as-is, but it would be even better with these improvements:
Changing pglogical_sync_worker_cleanup
and wait_for_sync_status_change
the same way, even though they're not expected to loop indefinitely like
apply_work
Normal order of operations is: WaitLatch ResetLatch check WL_POSTMASTER_DEATH CHECK_FOR_INTERRUPTS() normal loop work
See https://github.com/postgres/postgres/blob/b1a8dc846da4d96d903dcb5733f68a1e02d82a23/src/include/storage/latch.h#L41 for some background and https://github.com/postgres/postgres/blob/d392e9bdea957964e1fa6a5481e5adb5904d759a/src/test/modules/worker_spi/worker_spi.c#L238 for an example. This PR is instead putting CHECK_FOR_INTERRUPTS() just before WaitLatch. This is largely harmless, and order does vary within pglogical code. Still, I recommend moving the call to just after the WL_POSTMASTER_DEATH check.
Add test coverage, perhaps just:
--- a/sql/parallel.sql
+++ b/sql/parallel.sql
@@ -34,6 +34,10 @@ SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', '
SELECT * FROM pglogical.show_subscription_status();
+-- apply worker shall not block DROP of a different database
+CREATE DATABASE to_be_dropped;
+DROP DATABASE to_be_dropped;
+
-- Make sure we see the slot and active connection
\c :provider_dsn
SELECT plugin, slot_type, database, active FROM pg_replication_slots;
Change leading whitespace to match surrounding lines (spaces vs. tabs).
This achieves a lot for a one-line change. @eulerto (since no other person has merged a pglogical commit in the last eight months), would you merge this? If not, what would you like to see first?
@petere I see you are the other of the two people who have committed to pglogical in the last 2y. This pull request achieves a lot for a one-line change. Would you merge this? If not, what would you like to see first?
@eulerto @petere @mwanner This is just adding a CHECK_FOR_INTERRUPTS() so DROP DATABASE does not hang. For 11 months, the pull request has been waiting attention from someone with repository write access. Would you grant me write access so I can take responsibility for merging this? If not, how should we get this process-hang bug resolved?
@nmisch I will check all pending PRs (including this one) in the next few days.
@eulerto This is still just adding one CHECK_FOR_INTERRUPTS(), and it's been waiting 14 months for someone with repository write access. Adding a CHECK_FOR_INTERRUPTS() is one of the safest changes doable in PostgreSQL code. If merging this creates the need for a followup fix, you can count on me to write that followup fix. What is one obstacle to merging it over the next month?
this was tested on top of:
PostgreSQL-15.3
+pglogical-2.4.3
resolve additional deadlock during
drop database
: fix #418 fix #399solution for the deadlock in
pglogical_apply_main()
is identical to the solution for the deadlock inpglogical_supervisor_main()
: https://github.com/2ndQuadrant/pglogical/pull/403