2ndQuadrant / pglogical

Logical Replication extension for PostgreSQL 17, 16, 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.
http://2ndquadrant.com/en/resources/pglogical/
Other
1.01k stars 153 forks source link

Segmentation fault after upgrade to Pg 13 #343

Open its-johnt opened 3 years ago

its-johnt commented 3 years ago

Hi! We're using Pglogical to sync from Pg 12.8 to Pg 13.4, using pglogical_create_subscriber. We're on Ubuntu 20.04, using the debs from PGDG. I also tried using the pglogical 2.4.0 debs from dl.2ndquadrant.com, which is what the trace below is from. The upgrade goes smoothly, then shortly after Pg is upgraded from 12 to 13 with pg_upgrade, while we're running ANALYZE on the subscriber, we get:

background worker "pglogical apply 18174:1918136780" (PID 9965) was terminated by signal 11: Segmentation fault

Below is a backtrace. Please let me know if there's any other info that would be helpful to troubleshoot.

Thank you for your help.

(gdb) backtrace
#0 0x00005627015f892b in GetActiveSnapshot () at ./build/../src/backend/utils/time/snapmgr.c:845
#1 0x000056270135a2c0 in postquel_start (fcache=0x5627030ed418, es=0x5627030e8960)
at ./build/../src/backend/executor/functions.c:830
#2 fmgr_sql (fcinfo=0x562703087ac0) at ./build/../src/backend/executor/functions.c:1161
#3 0x000056270134755b in ExecInterpExpr (state=0x5627030865d8, econtext=0x562703086668, isnull=)
at ./build/../src/backend/executor/execExprInterp.c:704
#4 0x000056270133f1ea in ExecEvalExprSwitchContext (isNull=0x7ffeff428d57, econtext=0x562703086668, state=)
at ./build/../src/include/executor/executor.h:322
#5 ExecCheck (state=, econtext=econtext@entry=0x562703086668) at ./build/../src/backend/executor/execExpr.c:642
#6 0x00005627014ea110 in domain_check_input (value=94725554599368, isnull=false, my_extra=0x562703086318)
at ./build/../src/backend/utils/adt/domains.c:181
#7 0x00005627014ea22e in domain_in (fcinfo=0x7ffeff428e10) at ./build/../src/backend/utils/adt/domains.c:256
#8 0x00005627015ca766 in InputFunctionCall (flinfo=0x7ffeff428e90, str=0x5627030d4268 "75.110.82.213", typioparam=,
typmod=) at ./build/../src/backend/utils/fmgr/fmgr.c:1550
#9 0x00005627015ca902 in OidInputFunctionCall (functionId=, str=0x5627030d4268 "75.110.82.213",
typioparam=2014332433, typmod=-1) at ./build/../src/backend/utils/fmgr/fmgr.c:1653
#10 0x00007f4201fb7320 in pglogical_read_tuple (in=0x7ffeff4312a0, rel=0x5627030c7760, tuple=0x7ffeff42d110)
at ./pglogical_proto_native.c:814
#11 0x00007f4201fb876d in pglogical_read_update (in=in@entry=0x7ffeff4312a0, lockmode=lockmode@entry=3,
hasoldtup=hasoldtup@entry=0x7ffeff42900f, oldtup=oldtup@entry=0x7ffeff429010, newtup=newtup@entry=0x7ffeff42d110)
at ./pglogical_proto_native.c:674
#12 0x00007f4201f9cefa in handle_update (s=s@entry=0x7ffeff4312a0) at ./pglogical_apply.c:657
#13 0x00007f4201f9e74b in replication_handler (s=0x7ffeff4312a0) at ./pglogical_apply.c:1135
#14 apply_work (streamConn=) at ./pglogical_apply.c:1421
#15 0x00007f4201f9f19d in pglogical_apply_main (main_arg=) at ./pglogical_apply.c:1978
#16 0x0000562701429728 in StartBackgroundWorker () at ./build/../src/backend/postmaster/bgworker.c:890
#17 0x0000562701436db3 in do_start_bgworker (rw=) at ./build/../src/backend/postmaster/postmaster.c:5870
#18 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:6095
#19 0x0000562701437815 in sigusr1_handler (postgres_signal_arg=)
at ./build/../src/backend/postmaster/postmaster.c:5255
#20
#21 0x00007f4204a940da in select () from /lib/x86_64-linux-gnu/libc.so.6
#22 0x0000562701437ec6 in ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1703
#23 0x00005627014399cf in PostmasterMain (argc=5, argv=) at ./build/../src/backend/postmaster/postmaster.c:1412
#24 0x00005627011b968a in main (argc=5, argv=0x562702fd5f10) at ./build/../src/backend/main/main.c:210
eulerto commented 3 years ago

It seems similar to issue #336 but I didn't observe a crash. Could you try building from this repository to figure out it is the same issue that was fixed by commit bd2e972dca9a9fdb7a5ae88f41780fc24ade8632 ?

its-johnt commented 3 years ago

Hi, thanks for taking a look. I created a new package that included bd2e972 and tried again, and it still crashed. I got pretty much the same backtrace:


For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/postgresql/13/bin/postgres...
Reading symbols from /usr/lib/debug/.build-id/14/e9b5fd51f5703e6029ac43cf0e50f6801789ba.debug...
[New LWP 30984]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: pglogical apply 18174:1918136780                                    '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055b0a8b0b92b in GetActiveSnapshot () at ./build/../src/backend/utils/time/snapmgr.c:845
845 ./build/../src/backend/utils/time/snapmgr.c: No such file or directory.
(gdb) backtrace
#0  0x000055b0a8b0b92b in GetActiveSnapshot () at ./build/../src/backend/utils/time/snapmgr.c:845
#1  0x000055b0a886d2c0 in postquel_start (fcache=0x55b0a99d7838, es=0x55b0a9a0cdb0)
    at ./build/../src/backend/executor/functions.c:830
#2  fmgr_sql (fcinfo=0x55b0a98e9728) at ./build/../src/backend/executor/functions.c:1161
#3  0x000055b0a885a55b in ExecInterpExpr (state=0x55b0a98e8868, econtext=0x55b0a98e88f8, isnull=<optimized out>)
    at ./build/../src/backend/executor/execExprInterp.c:704
#4  0x000055b0a88521ea in ExecEvalExprSwitchContext (isNull=0x7ffe214fd057, econtext=0x55b0a98e88f8, state=<optimized out>)
    at ./build/../src/include/executor/executor.h:322
#5  ExecCheck (state=<optimized out>, econtext=econtext@entry=0x55b0a98e88f8) at ./build/../src/backend/executor/execExpr.c:642
#6  0x000055b0a89fd110 in domain_check_input (value=94217247298664, isnull=false, my_extra=0x55b0a98e8608)
    at ./build/../src/backend/utils/adt/domains.c:181
#7  0x000055b0a89fd22e in domain_in (fcinfo=0x7ffe214fd110) at ./build/../src/backend/utils/adt/domains.c:256
#8  0x000055b0a8add766 in InputFunctionCall (flinfo=0x7ffe214fd190, str=0x55b0a98f1d08 "98.28.217.128", 
    typioparam=<optimized out>, typmod=<optimized out>) at ./build/../src/backend/utils/fmgr/fmgr.c:1550
#9  0x000055b0a8add902 in OidInputFunctionCall (functionId=<optimized out>, str=0x55b0a98f1d08 "98.28.217.128", 
    typioparam=2014332433, typmod=-1) at ./build/../src/backend/utils/fmgr/fmgr.c:1653
#10 0x00007f2d16d88300 in pglogical_read_tuple (in=0x7ffe215055a0, rel=0x55b0a99e05d8, tuple=0x7ffe21501410)
    at ./pglogical_proto_native.c:814
#11 0x00007f2d16d8974d in pglogical_read_update (in=in@entry=0x7ffe215055a0, lockmode=lockmode@entry=3, 
    hasoldtup=hasoldtup@entry=0x7ffe214fd30f, oldtup=oldtup@entry=0x7ffe214fd310, newtup=newtup@entry=0x7ffe21501410)
    at ./pglogical_proto_native.c:674
#12 0x00007f2d16d6defa in handle_update (s=s@entry=0x7ffe215055a0) at ./pglogical_apply.c:657
#13 0x00007f2d16d6f74b in replication_handler (s=0x7ffe215055a0) at ./pglogical_apply.c:1135
#14 apply_work (streamConn=<optimized out>) at ./pglogical_apply.c:1421
#15 0x00007f2d16d7019d in pglogical_apply_main (main_arg=<optimized out>) at ./pglogical_apply.c:1978
#16 0x000055b0a893c728 in StartBackgroundWorker () at ./build/../src/backend/postmaster/bgworker.c:890
#17 0x000055b0a8949db3 in do_start_bgworker (rw=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:5870
#18 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:6095
#19 0x000055b0a894a815 in sigusr1_handler (postgres_signal_arg=<optimized out>)
    at ./build/../src/backend/postmaster/postmaster.c:5255
#20 <signal handler called>
#21 0x00007f2d198650da in select () from /lib/x86_64-linux-gnu/libc.so.6
#22 0x000055b0a894aec6 in ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1703
#23 0x000055b0a894c9cf in PostmasterMain (argc=5, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1412
#24 0x000055b0a86cc68a in main (argc=5, argv=0x55b0a9836f10) at ./build/../src/backend/main/main.c:210
(gdb)
eulerto commented 3 years ago

Could you try to obtain the relation? You can use gdb.

f 10
p rel->nspname
p rel->relname

After that try to create a self-contained test case using this relation.

The other option is to obtain the commands from the replication slot on provider. Execute the following command:

SELECT pg_logical_slot_peek_changes('SLOTNAMEGOESHERE',
                                      NULL, NULL,
                                      'min_proto_version', '1',
                                      'max_proto_version', '1',
                                      'startup_params_format', '1',
                                      'proto_format', 'json');
simonat2ndQuadrant commented 3 years ago

On Fri, 24 Sept 2021 at 17:26, John Thompson @.***> wrote:

Program terminated with signal SIGSEGV, Segmentation fault.

0 0x000055b0a8b0b92b in GetActiveSnapshot () at ./build/../src/backend/utils/time/snapmgr.c:845

845 ./build/../src/backend/utils/time/snapmgr.c: No such file or directory. (gdb) backtrace

0 0x000055b0a8b0b92b in GetActiveSnapshot () at ./build/../src/backend/utils/time/snapmgr.c:845

1 0x000055b0a886d2c0 in postquel_start (fcache=0x55b0a99d7838, es=0x55b0a9a0cdb0)

at ./build/../src/backend/executor/functions.c:830

2 fmgr_sql (fcinfo=0x55b0a98e9728) at ./build/../src/backend/executor/functions.c:1161

3 0x000055b0a885a55b in ExecInterpExpr (state=0x55b0a98e8868, econtext=0x55b0a98e88f8, isnull=)

at ./build/../src/backend/executor/execExprInterp.c:704

4 0x000055b0a88521ea in ExecEvalExprSwitchContext (isNull=0x7ffe214fd057, econtext=0x55b0a98e88f8, state=)

at ./build/../src/include/executor/executor.h:322

5 ExecCheck (state=, @.***=0x55b0a98e88f8) at ./build/../src/backend/executor/execExpr.c:642

6 0x000055b0a89fd110 in domain_check_input (value=94217247298664, isnull=false, my_extra=0x55b0a98e8608)

at ./build/../src/backend/utils/adt/domains.c:181

7 0x000055b0a89fd22e in domain_in (fcinfo=0x7ffe214fd110) at ./build/../src/backend/utils/adt/domains.c:256

8 0x000055b0a8add766 in InputFunctionCall (flinfo=0x7ffe214fd190, str=0x55b0a98f1d08 "98.28.217.128",

typioparam=<optimized out>, typmod=<optimized out>) at ./build/../src/backend/utils/fmgr/fmgr.c:1550

9 0x000055b0a8add902 in OidInputFunctionCall (functionId=, str=0x55b0a98f1d08 "98.28.217.128",

typioparam=2014332433, typmod=-1) at ./build/../src/backend/utils/fmgr/fmgr.c:1653

10 0x00007f2d16d88300 in pglogical_read_tuple (in=0x7ffe215055a0, rel=0x55b0a99e05d8, tuple=0x7ffe21501410)

at ./pglogical_proto_native.c:814

11 0x00007f2d16d8974d in pglogical_read_update @.=0x7ffe215055a0, @.=3,

***@***.***=0x7ffe214fd30f, ***@***.***=0x7ffe214fd310, ***@***.***=0x7ffe21501410)
at ./pglogical_proto_native.c:674

12 0x00007f2d16d6defa in handle_update @.***=0x7ffe215055a0) at ./pglogical_apply.c:657

pglogical is barfing while trying to execute an SQL function that is part of a CHECK expression on a domain for a column that is being UPDATEd.

So it should be easy to come up with a repeatable test case from the above. ANALYZE doesn't appear to be involved.

My guess is that we don't set the active snapshot early enough and this is causing the SQL function execution to fail. I notice ensure_transaction() doesn't create a snapshot.

-- Simon Riggs http://www.EnterpriseDB.com/

its-johnt commented 3 years ago

Hi, thanks for the info. Indeed, ANALYZE doesn't seem to be related. Neither does upgrading to Pg 13. I repro a segfault with just Pg 12.8 and pglogical 2.4.0 + the commit you mentioned above with this schema:

CREATE FUNCTION public.inet_ipv6_transitional_networks() RETURNS SETOF cidr
    LANGUAGE sql IMMUTABLE STRICT LEAKPROOF PARALLEL SAFE
    AS $$
  VALUES ('2001::/32'::cidr), ('2002::/16'), ('64:ff9b::/96'), ('::ffff:0:0:0/96'), ('::ffff:0:0/96')
$$;

CREATE FUNCTION public.inet_is_v6_transitional(inet) RETURNS boolean
    LANGUAGE sql IMMUTABLE STRICT LEAKPROOF PARALLEL SAFE
    AS $_$
  SELECT EXISTS (SELECT FROM inet_ipv6_transitional_networks() s(net) WHERE net >>= $1)
$_$;

CREATE DOMAIN public.inet_normalized AS inet
CONSTRAINT inet_normalized_check CHECK ((NOT public.inet_is_v6_transitional(VALUE)));

CREATE TABLE public.test_ip_address (
    ip_address public.inet_normalized NOT NULL,
    CONSTRAINT test_ip_address_ipv4_is_address CHECK (((masklen((ip_address)::inet) = 32) OR (family((ip_address)::inet) = 6)))
);

ALTER TABLE ONLY public.test_ip_address
    ADD CONSTRAINT test_ip_address_normalized_pkey PRIMARY KEY (ip_address);

INSERT INTO public.test_ip_address VALUES
('152.224.0.1'),
('152.224.0.2');

After setting up logical replication with pglogical_create_subscriber I ran DELETE FROM test_ip_address WHERE ip_address = '152.224.0.1' on the source and the dest segfaulted. The backtrace looks different this time, but I'm testing on a different system, so I'm not sure if it's that or something different about crash:

(gdb) backtrace
#0  0x000055e5a9964604 in postquel_start (fcache=0x55e5aba63ab0, es=0x55e5aba6d488)
    at utils/time/./build/../src/backend/utils/time/snapmgr.c:845
#1  fmgr_sql (fcinfo=0x55e5aba5fc18) at executor/./build/../src/backend/executor/functions.c:1144
#2  0x000055e5a994d9e7 in ExecInterpExpr (state=0x55e5ab9f66b0, econtext=0x55e5ab9f6740, isnull=<optimized out>)
    at executor/./build/../src/backend/executor/execExprInterp.c:654
#3  0x000055e5a9b31229 in ExecEvalExprSwitchContext (isNull=0x7ffcc3d2c507, econtext=0x55e5ab9f6740, state=<optimized out>)
    at executor/./build/../src/include/executor/executor.h:316
#4  ExecCheck (econtext=0x55e5ab9f6740, state=<optimized out>) at executor/./build/../src/backend/executor/execExpr.c:639
#5  domain_check_input (value=94444915654856, isnull=false, my_extra=0x55e5ab9f63f0)
    at utils/adt/./build/../src/backend/utils/adt/domains.c:181
#6  0x000055e5a9b314e1 in domain_recv (fcinfo=0x7ffcc3d2c580) at utils/adt/./build/../src/backend/utils/adt/domains.c:312
#7  0x000055e5a9c3b906 in ReceiveFunctionCall (flinfo=flinfo@entry=0x7ffcc3d2c600, buf=buf@entry=0x7ffcc3d2c6b0, 
    typioparam=typioparam@entry=16389, typmod=typmod@entry=-1) at utils/fmgr/./build/../src/backend/utils/fmgr/fmgr.c:1607
#8  0x000055e5a9c41545 in OidReceiveFunctionCall (functionId=<optimized out>, buf=buf@entry=0x7ffcc3d2c6b0, typioparam=16389, 
    typmod=-1) at utils/fmgr/./build/../src/backend/utils/fmgr/fmgr.c:1670
#9  0x00007f15ad7168bd in pglogical_read_tuple (in=0x7ffcc3d308e0, rel=0x55e5aba41ba0, tuple=0x7ffcc3d2c750)
    at ./pglogical_proto_native.c:792
#10 0x00007f15ad7193e0 in pglogical_read_delete (in=0x7ffcc3d308e0, lockmode=3, oldtup=0x7ffcc3d2c750)
    at ./pglogical_proto_native.c:708
#11 0x00007f15ad703ec7 in handle_delete (s=s@entry=0x7ffcc3d308e0) at ./pglogical_apply.c:686
#12 0x00007f15ad704a0e in replication_handler (s=0x7ffcc3d308e0) at ./pglogical_apply.c:1139
#13 apply_work (streamConn=<optimized out>) at ./pglogical_apply.c:1421
#14 0x00007f15ad705413 in pglogical_apply_main (main_arg=<optimized out>) at ./pglogical_apply.c:1978
#15 0x000055e5a9a5444c in StartBackgroundWorker () at postmaster/./build/../src/backend/postmaster/bgworker.c:911
#16 0x000055e5a9a69618 in do_start_bgworker (rw=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:5850
#17 maybe_start_bgworkers () at postmaster/./build/../src/backend/postmaster/postmaster.c:6075
#18 0x000055e5a9a6984d in sigusr1_handler (postgres_signal_arg=<optimized out>)
    at postmaster/./build/../src/backend/postmaster/postmaster.c:5212
#19 <signal handler called>
#20 0x00007f15b092dac7 in __GI___select (nfds=nfds@entry=5, readfds=readfds@entry=0x7ffcc3d313f0, writefds=writefds@entry=0x0, 
    exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffcc3d31330) at ../sysdeps/unix/sysv/linux/select.c:49
#21 0x000055e5a9a5f02e in ServerLoop () at postmaster/./build/../src/backend/postmaster/postmaster.c:1689
#22 0x000055e5a9a60d60 in PostmasterMain (argc=3, argv=<optimized out>)
    at postmaster/./build/../src/backend/postmaster/postmaster.c:1398
#23 0x000055e5a97735f7 in main (argc=3, argv=0x55e5ab959250) at main/./build/../src/backend/main/main.c:228

Thanks for your help. Please let me know if there's any other info that would be helpful in solving the issue.

eulerto commented 2 years ago

@its-johnt could you try the PR #351 ?

its-johnt commented 2 years ago

Hi,

I retried my repro from above and with your fix the crashes are gone and replication is working. Thanks!