2ndQuadrant / pglogical

Logical Replication extension for PostgreSQL 17, 16, 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
http://2ndquadrant.com/en/resources/pglogical/
Other
1.02k stars 154 forks source link

resolve additional deadlock during drop database :: pglogical_apply.c :: CHECK_FOR_INTERRUPTS() #427

Open Andrei-Pozolotin opened 1 year ago

Andrei-Pozolotin commented 1 year ago

this was tested on top of: PostgreSQL-15.3 + pglogical-2.4.3

resolve additional deadlock during drop database: fix #418 fix #399

solution for the deadlock in pglogical_apply_main() is identical to the solution for the deadlock in pglogical_supervisor_main(): https://github.com/2ndQuadrant/pglogical/pull/403

Andrei-Pozolotin commented 1 year ago

note: there are still few more "deadlock candidates" in the code, for example, search for:

while (!got_SIGTERM)

and review (loops are not using CHECK_FOR_INTERRUPTS()):

pglogical_sync.c

pglogical_sync_worker_cleanup(PGLogicalSubscription *sub)
wait_for_sync_status_change(Oid subid, const char *nspname, const char *relname,
Andrei-Pozolotin commented 1 year ago

@eulerto please review

nmisch commented 1 year ago

I've reviewed this. Note that the linked issues say this began with PostgreSQL 15 commit postgres/postgres@4eb2176318d, but it was Windows-only until postgres/postgres@e2f65f4255. I do recommend merging this PR as-is, but it would be even better with these improvements:

  1. Changing pglogical_sync_worker_cleanup and wait_for_sync_status_change the same way, even though they're not expected to loop indefinitely like apply_work

  2. Normal order of operations is: WaitLatch ResetLatch check WL_POSTMASTER_DEATH CHECK_FOR_INTERRUPTS() normal loop work

    See https://github.com/postgres/postgres/blob/b1a8dc846da4d96d903dcb5733f68a1e02d82a23/src/include/storage/latch.h#L41 for some background and https://github.com/postgres/postgres/blob/d392e9bdea957964e1fa6a5481e5adb5904d759a/src/test/modules/worker_spi/worker_spi.c#L238 for an example. This PR is instead putting CHECK_FOR_INTERRUPTS() just before WaitLatch. This is largely harmless, and order does vary within pglogical code. Still, I recommend moving the call to just after the WL_POSTMASTER_DEATH check.

  3. Add test coverage, perhaps just:

    --- a/sql/parallel.sql
    +++ b/sql/parallel.sql
    @@ -34,6 +34,10 @@ SELECT sync_kind, sync_subid, sync_nspname, sync_relname, sync_status IN ('y', '
    
    SELECT * FROM pglogical.show_subscription_status();
    
    +-- apply worker shall not block DROP of a different database
    +CREATE DATABASE to_be_dropped;
    +DROP DATABASE to_be_dropped;
    +
    -- Make sure we see the slot and active connection
    \c :provider_dsn
    SELECT plugin, slot_type, database, active FROM pg_replication_slots;
  4. Change leading whitespace to match surrounding lines (spaces vs. tabs).

nmisch commented 9 months ago

This achieves a lot for a one-line change. @eulerto (since no other person has merged a pglogical commit in the last eight months), would you merge this? If not, what would you like to see first?

nmisch commented 8 months ago

@petere I see you are the other of the two people who have committed to pglogical in the last 2y. This pull request achieves a lot for a one-line change. Would you merge this? If not, what would you like to see first?

nmisch commented 5 months ago

@eulerto @petere @mwanner This is just adding a CHECK_FOR_INTERRUPTS() so DROP DATABASE does not hang. For 11 months, the pull request has been waiting attention from someone with repository write access. Would you grant me write access so I can take responsibility for merging this? If not, how should we get this process-hang bug resolved?

eulerto commented 5 months ago

@nmisch I will check all pending PRs (including this one) in the next few days.

nmisch commented 2 months ago

@eulerto This is still just adding one CHECK_FOR_INTERRUPTS(), and it's been waiting 14 months for someone with repository write access. Adding a CHECK_FOR_INTERRUPTS() is one of the safest changes doable in PostgreSQL code. If merging this creates the need for a followup fix, you can count on me to write that followup fix. What is one obstacle to merging it over the next month?