replication in catchup state, wal sender high cpu usage

fr-butch commented 8 years ago

replication stop working after 2 hours:

postgres=# select * from pg_stat_replication;
  pid  | usesysid | usename  | application_name | client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
  7036 |       10 | postgres | walreceiver      | 192.168.0.55 |                 |       55638 | 2016-09-07 13:54:47.633061+03 |              | streaming | 162/2D57F1B8  | 162/2D57F1B8   | 162/2D57F1B8   | 162/2D57D490    |             0 | async
 14934 |       10 | postgres | audit2           | 192.168.0.57 |                 |       54697 | 2016-09-08 09:44:11.601138+03 |     82827775 | catchup   | 160/3C3DFE90  | 160/3C3DFE90   | 160/3C3DFE90   | 160/3C3DFE90    |             0 | async
(2 rows)

postgres=# select * from pg_replication_slots;
              slot_name               |      plugin      | slot_type |  datoid  |    database     | active | active_pid |   xmin   | catalog_xmin | restart_lsn  
--------------------------------------+------------------+-----------+----------+-----------------+--------+------------+----------+--------------+--------------
 pgl_lsd_master_orig_provider1_audit2 | pglogical_output | logical   | 60735254 | lsd_master_orig | t      |      14934 |          |     82379410 | 160/3949FA30
 pgl_lsd_master_orig_provider1_audit1 | pglogical_output | logical   | 60735254 | lsd_master_orig | f      |            |          |     82287138 | 15F/D5675338
 db2_slot                             |                  | physical  |          |                 | t      |       7036 | 82858274 |              | 162/2D586190
(3 rows)

and got 100% cpu usage

postgres  7012  0.0  0.0 6577536 2896 ?        Ss   Sep07   0:34 postgres: wal writer process   
postgres  7036  0.1  0.0 6579500 3548 ?        Ss   Sep07   1:24 postgres: wal sender process postgres 192.168.0.55(55638) streaming 162/2D7C17E8
postgres 14934 99.5  0.1 6600412 24988 ?       Rs   09:44  33:54 postgres: wal sender process postgres 192.168.0.57(54697) idle

in logs on provider:

2016-09-08 09:44:11 MSK [14934]: [3-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 DEBUG:  received replication command: START_REPLICATION SLOT "pgl_lsd_master_orig_provider1_audit2" LOGICAL 160/393F5CF0 (expected_encoding 'UTF8', min_proto_version '1', max_proto_version '1', startup_params_format '1', "binary.want_internal_basetypes" '1', "binary.want_binary_basetypes" '1', "binary.basetypes_major_version" '905', "binary.sizeof_datum" '8', "binary.sizeof_int" '4', "binary.sizeof_long" '8', "binary.bigendian" '0', "binary.float4_byval" '1', "binary.float8_byval" '1', "binary.integer_datetimes" '1', "hooks.setup_function" 'pglogical.pglogical_hooks_setup', "pglogical.forward_origins" '"all"', "pglogical.replication_set_names" 'audit_set', "relmeta_cache_size" '-1', pg_version '90503', pglogical_version '1.1.2', pglogical_version_num '10102')
2016-09-08 09:44:11 MSK [14934]: [5-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 DEBUG:  Unrecognised pglogical parameter startup_params_format ignored
2016-09-08 09:44:11 MSK [14934]: [6-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 CONTEXT:  slot "pgl_lsd_master_orig_provider1_audit2", output plugin "pglogical_output", in the startup callback
2016-09-08 09:44:11 MSK [14934]: [7-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 DEBUG:  Unrecognised pglogical parameter pglogical.forward_origins ignored
2016-09-08 09:44:11 MSK [14934]: [8-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 CONTEXT:  slot "pgl_lsd_master_orig_provider1_audit2", output plugin "pglogical_output", in the startup callback
2016-09-08 09:44:11 MSK [14934]: [9-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 DEBUG:  Unrecognised pglogical parameter pglogical.replication_set_names ignored
2016-09-08 09:44:11 MSK [14934]: [10-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 CONTEXT:  slot "pgl_lsd_master_orig_provider1_audit2", output plugin "pglogical_output", in the startup callback
2016-09-08 09:44:11 MSK [14934]: [11-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 DEBUG:  Unrecognised pglogical parameter relmeta_cache_size ignored
2016-09-08 09:44:11 MSK [14934]: [12-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 CONTEXT:  slot "pgl_lsd_master_orig_provider1_audit2", output plugin "pglogical_output", in the startup callback
2016-09-08 09:44:11 MSK [14934]: [13-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 DEBUG:  Unrecognised pglogical parameter pglogical_version ignored
2016-09-08 09:44:11 MSK [14934]: [14-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 CONTEXT:  slot "pgl_lsd_master_orig_provider1_audit2", output plugin "pglogical_output", in the startup callback
2016-09-08 09:44:11 MSK [14934]: [15-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 DEBUG:  Unrecognised pglogical parameter pglogical_version_num ignored
2016-09-08 09:44:11 MSK [14934]: [16-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 CONTEXT:  slot "pgl_lsd_master_orig_provider1_audit2", output plugin "pglogical_output", in the startup callback
2016-09-08 09:44:11 MSK [14934]: [17-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 LOG:  starting logical decoding for slot "pgl_lsd_master_orig_provider1_audit2"
2016-09-08 09:44:11 MSK [14934]: [19-1] db=lsd_master_orig,appname=audit2,user=postgres,client=192.168.0.57 LOG:  logical decoding found consistent point at 160/3949FA30

on subscriber:

2016-09-08 09:44:06 MSK [6937]: [5-1] db=,appname=,user=,client= ERROR:  connection to other side has died
2016-09-08 09:44:06 MSK [938]: [135-1] db=,appname=,user=,client= LOG:  worker process: pglogical apply 186620:1800593394 (PID 6937) exited with exit code 1
2016-09-08 09:44:06 MSK [938]: [136-1] db=,appname=,user=,client= DEBUG:  unregistering background worker "pglogical apply 186620:1800593394"
2016-09-08 09:44:11 MSK [938]: [137-1] db=,appname=,user=,client= DEBUG:  registering background worker "pglogical apply 186620:1800593394"
2016-09-08 09:44:11 MSK [938]: [138-1] db=,appname=,user=,client= DEBUG:  starting background worker process "pglogical apply 186620:1800593394"
2016-09-08 09:44:11 MSK [25448]: [1-1] db=,appname=,user=,client= LOG:  starting apply for subscription audit2
2016-09-08 09:44:11 MSK [25448]: [2-1] db=,appname=,user=,client= DEBUG:  connecting to provider provider1, dsn host=192.168.0.51 port=5432 dbname=lsd_master_orig
2016-09-08 09:44:12 MSK [25448]: [3-1] db=,appname=,user=,client= DEBUG:  upstream Pg version is 9.5.2
2016-09-08 09:44:12 MSK [25448]: [4-1] db=,appname=,user=,client= DEBUG:  changeset origin forwarding enabled: t

upstream Pg version is 9.5.2 -- this is strange, our postgresql server was 9.5.0 and then was updated to 9.5.4

postgres=# select version();
                                                 version                                                 
---------------------------------------------------------------------------------------------------------
 PostgreSQL 9.5.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4), 64-bit
(1 row)

I have one set with 1 table on provider

SELECT pglogical.create_node(
    node_name := 'provider1',
    dsn := 'host=192.168.0.51 port=5432 dbname=lsd_master_orig'
);
select pglogical.create_replication_set('audit_set', TRUE, TRUE, TRUE, FALSE);
select pglogical.replication_set_add_table('audit_set', 'audit.public$mc_case_y2016_m9', true);

subscriber:

SELECT pglogical.create_node(
    node_name := 'subscriber1',
    dsn := 'host=192.168.0.57 port=5432 dbname=lsd_master_orig'
);

SELECT pglogical.create_subscription(
    subscription_name := 'audit2',
    provider_dsn := 'host=192.168.0.51 port=5432 dbname=lsd_master_orig',
    replication_sets := '{audit_set}',
        synchronize_structure := false,
        synchronize_data := true
);

lsd_master_orig=# select pglogical.show_subscription_status();
                                                          show_subscription_status                                                          
--------------------------------------------------------------------------------------------------------------------------------------------
 (audit2,replicating,provider1,"host=192.168.0.51 port=5432 dbname=lsd_master_orig",pgl_lsd_master_orig_provider1_audit2,{audit_set},{all})
(1 row)

I have the same problem with previous subscription (audit1). So I have droped database on subscriber and droped extension pglogical on provider, updated postgresql packets, restart, . Then create database on subscriber and start over with no luck.

perf top for wal sender on provider

Samples: 2M of event 'cpu-clock', Event count (approx.): 13123950452                                                                                                                                                                                                          
Overhead  Shared Object        Symbol                                                                                                                                                                                                                                         
  87.27%  postgres             [.] hash_seq_search
   5.37%  postgres             [.] CatalogCacheIdInvalidate
   2.07%  postgres             [.] hash_search_with_hash_value
   1.04%  postgres             [.] LocalExecuteInvalidationMessage
   1.00%  postgres             [.] CallSyscacheCallbacks
   0.61%  postgres             [.] RelfilenodeMapInvalidateCallback
   0.58%  postgres             [.] hash_uint32
   0.52%  postgres             [.] hash_search
   0.32%  postgres             [.] ReorderBufferCommit
   0.25%  postgres             [.] hash_seq_init
   0.21%  postgres             [.] deregister_seq_scan
   0.14%  postgres             [.] RelationCacheInvalidateEntry
   0.14%  postgres             [.] uint32_hash
   0.13%  pglogical_output.so  [.] relmeta_cache_callback
   0.08%  postgres             [.] PlanCacheRelCallback
   0.05%  postgres             [.] InvalidateCatalogSnapshot
   0.05%  pglogical_output.so  [.] hash_search@plt
   0.05%  [kernel]             [k] __do_softirq
   0.04%  postgres             [.] GetCurrentTransactionNestLevel
   0.02%  [kernel]             [k] vmxnet3_poll_rx_only
   0.01%  [kernel]             [k] _raw_spin_unlock_irqrestore
   0.01%  [kernel]             [k] vmxnet3_xmit_frame
   0.00%  [kernel]             [k] tcp_wfree
   0.00%  [kernel]             [k] vmxnet3_tq_tx_complete
   0.00%  [kernel]             [k] __slab_free
   0.00%  [kernel]             [k] kfree
   0.00%  [kernel]             [k] tcp_urg
   0.00%  [kernel]             [k] skb_release_data
   0.00%  [kernel]             [k] bio_advance
   0.00%  [kernel]             [k] tcp_v4_do_rcv
   0.00%  postgres             [.] binaryheap_first
   0.00%  [kernel]             [k] unmap_single
   0.00%  [kernel]             [k] skb_release_head_state
   0.00%  [kernel]             [k] raw_local_deliver
   0.00%  [kernel]             [k] blk_update_bidi_request
   0.00%  [kernel]             [k] tcp_send_delayed_ack
   0.00%  [kernel]             [k] sock_wfree
   0.00%  [kernel]             [k] vmxnet3_rq_rx_complete
   0.00%  [kernel]             [k] read_tsc
   0.00%  [kernel]             [k] check_leaf.isra.7
   0.00%  [kernel]             [k] tg_load_down
   0.00%  [kernel]             [k] __inet_lookup_established
   0.00%  [kernel]             [k] tcp_valid_rtt_meas
   0.00%  postgres             [.] NamespaceCallback
   0.00%  libc-2.17.so         [.] __memcmp_sse4_1
   0.00%  [kernel]             [k] queue_work_on
   0.00%  [kernel]             [k] vmxnet3_unmap_tx_buf
   0.00%  [kernel]             [k] net_rx_action
   0.00%  [kernel]             [k] swiotlb_unmap_page
   0.00%  postgres             [.] ResourceOwnerEnlargeRelationRefs
   0.00%  [kernel]             [k] __slab_alloc
   0.00%  [kernel]             [k] sock_def_readable
   0.00%  postgres             [.] PlanCacheFuncCallback
   0.00%  [kernel]             [k] retint_careful
   0.00%  [kernel]             [k] napi_complete
   0.00%  [kernel]             [k] inet_csk_destroy_sock
   0.00%  [kernel]             [k] consume_skb

strace for wal sender:

read(14, "H\0\0\0\0\0\0\0`\245{;`\1\0\0\4\0\0\0\0\0\0\0\236t\0\0\0\0\0\0"..., 72) = 72
read(14, "", 0)                         = 0
read(14, "H\0\0\0\0\0\0\0\240\245{;`\1\0\0\4\0\0\0\0\0\0\0\236t\0\0\0\0\0\0"..., 72) = 72
read(14, "", 0)                         = 0
read(14, "H\0\0\0\0\0\0\0\340\245{;`\1\0\0\1\0\0\0\0\0\0\0\177\6\0\0\26\277\236\3"..., 72) = 72
read(14, "", 0)                         = 0
read(14, "H\0\0\0\0\0\0\0@\250{;`\1\0\0\4\0\0\0\0\0\0\0\237t\0\0\0\0\0\0"..., 72) = 72
read(14, "", 0)                         = 0
read(14, "H\0\0\0\0\0\0\0\200\250{;`\1\0\0\4\0\0\0\0\0\0\0\237t\0\0\0\0\0\0"..., 72) = 72
read(14, "", 0)                         = 0

postgresql.conf:

max_worker_processes = 10
max_wal_senders = 10
wal_level = logical
wal_receiver_status_interval=10
shared_preload_libraries = 'pg_stat_statements,pgstattuple,pglogical'
wal_buffers = 64
wal_keep_segments = 1000
max_replication_slots = 10
track_commit_timestamp = on

# rpm -qa | grep postgres
postgresql95-pglogical-output-1.1.0-1_2ndQuadrant.el7.centos.x86_64
postgresql95-libs-9.5.4-1PGDG.rhel7.x86_64
postgresql95-pglogical-debuginfo-1.1.2-1_2ndQuadrant.el7.centos.x86_64
postgresql95-pglogical-1.1.2-1_2ndQuadrant.el7.centos.x86_64
postgresql95-9.5.4-1PGDG.rhel7.x86_64
postgresql95-pglogical-output-debuginfo-1.1.0-1_2ndQuadrant.el7.centos.x86_64
postgresql95-contrib-9.5.4-1PGDG.rhel7.x86_64
postgresql95-debuginfo-9.5.4-1PGDG.rhel7.x86_64
postgresql95-server-9.5.4-1PGDG.rhel7.x86_64

ringerc commented 8 years ago

Thanks for the detailed report. Sorry it took so long to respond, we're extremely busy on development and customer support.

This looks likely to be an issue with the relation metadata cache invalidation code. Have you been able to reliably reproduce this problem? Does it happen consistently?

Are you possibly able to attach 'gdb' to one of the high cpu using backends and get a few backtraces?

attach gdb, then

set pagination off
set logging on
bt full
info locals
cont
[control C]
bt full
info locals
cont
[control C]
bt full
info locals
cont

... repeat a few times then quit gdb and send the gdb.txt that results. A pastebin site, gist, or whatever is fine, or attach directly here.

Continue execution each time, then interrupt again and do another 'bt full'

azagirov commented 8 years ago

gdb.txt in pastebin http://pastebin.com/ME3Leysn

azagirov commented 8 years ago

In my replication sets ~4500 tables. This happened after I remove a ~1500 replicate tables in provider database. Then I remove this tables in subscriber database, but this not help. Replication does not start to work.

postgres=#` select * from pg_stat_replication;
  pid  | usesysid | usename  | application_name | client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 28731 |       10 | postgres | walreceiver      | 192.168.0.55 |                 |       48850 | 2016-10-14 15:24:36.33258+03  |              | streaming | 199/3FDC0270  | 199/3FDC0270   | 199/3FDC0270   | 199/3FDC0160    |             0 | async
  1647 |       10 | postgres | audit2           | 192.168.0.57 |                 |       45068 | 2016-10-17 12:38:25.805543+03 |     94914676 | catchup   | 199/3D4F5868  | 199/3D47FFD8   | 199/3D47FFD8   | 199/3D47FFD8    |             0 | async
(2 rows)

postgres=# select * from pg_replication_slots;
              slot_name               |      plugin      | slot_type |  datoid  |    database     | active | active_pid |   xmin   | catalog_xmin | restart_lsn  
--------------------------------------+------------------+-----------+----------+-----------------+--------+------------+----------+--------------+--------------
 pgl_lsd_master_orig_provider1_audit2 | pglogical_output | logical   | 60735254 | lsd_master_orig | t      |      14980 |          |     94912304 | 199/397F56F8
 db2_slot                             |                  | physical  |          |                 | t      |      28731 | 94915845 |              | 199/3FE72260
(2 rows)

ringerc commented 8 years ago

On 17 October 2016 at 17:48, azagirov notifications@github.com wrote:

This happened after I remove a few replicate tables in provider database. Then I remove this tables in subscriber database, but this not help. Replication does not start to work.

Interesting. From the debug info provided, it seems to be busy processing cache invalidations inside logical decoding on the upstream's walsender, rather than in pglogical's output plugin or downstream.

That's consistent with the 'ps' output etc you sent.

I can't tell from this whether it's in some very expensive/slow operation where we have something scaling O(big) or whether it's in an infinite loop and, if so, at what level it's looping. You'd need to do some more interactive debugging for that. A start would be to use gdb, with logging enabled, to run each function until finish (using the gdb command "finish"), walking up the call stack, and see which function never ends. Then interrupt with control-C and single-step-over ("next") to see what part of the function it's looping in, and how. Print variables with "info args", "info locals", etc.

Since you seem to be pretty good with the tools available hopefully this is something you're able to do.

If not, try to see if you can reproduce this in a cut down test case to isolate what exactly is causing the problematic behaviour. If you can work out a set of steps and inputs (SQL scripts, commands, etc) to reproduce this and attach it I should be able to take a look.

I currently have to prioritize some development and customer support work so I can't spend a long time trying to work out what could be going on and/or make a wholly synthetic test case, but I'm happy to help guide you to collect info and investigate in the hopes we can track the issue down.

Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services

ringerc commented 8 years ago

It's immediately noteworthy that you're using a LOT more tables than we typically deal with. So if we have something that's O(n^2) for n tables or anything like that, it'd be consistent with what you're seeing.

It would be interesting to try to create a slot using the test_decoding output plugin and replay from it. See if you have the same performance issues there. If so, that tells us pretty conclusively that it's a performance issue in logical decoding in core PostgreSQL.

You can use pg_recvlogical to create a test_decoding slot and replay from it. You'll need the test_decoding contrib installed.

PJMODOS commented 7 years ago

There were some fixed in PostgreSQL recently that might help with this. I would recommend installing the new minor version that is coming next week.

mfenniak commented 7 years ago

I believe I've run across a very similar (if not the same) issue, that is being discussed & diagnosed on pgsql-general. This issue report helped me narrow down and reproduce test cases for the problem affecting me, so thanks for that.

(this probably isn't really a 2ndQuadrant/pglogical issue; it seems to be a core engine issue)

PJMODOS commented 7 years ago

Yes, it's logical decoding issue so it affects everything using that, but we fix issues in core as well :)

derkan commented 5 years ago

Same here. Can not shutdown db, service shutdown command waits forever. While it is waiting for shutdown wallsender processes consume %100 CPU. Using PGLogical 2.2.1 and PostgreSQL 11.1.

Update: I think this was because of some tables having sync problems. Some table's sync_status wasn't r. So, I reconfigured replication and now PG can shutdown cleanly.

ringerc commented 5 years ago

@derkan If you can reproduce the issue and get the chance to do a perf record --call-graph dwarf -u postgres and perf report -g that might be helpful.

derkan commented 5 years ago

Sorry, I can't reproduce now, but I've added #200 to get warning in creating a subscription to a remote db if db_name in dsn and current database name differs. This kind of error makes remotedb to use %100 on shutdown. (Although two db's schema are same.)

ringerc commented 5 years ago

The db name should be able to differ; if it can't there is definitely a bug at work. The regression tests rely on the db name differing to work though.

derkan commented 5 years ago

The db name should be able to differ; if it can't there is definitely a bug at work. The regression tests rely on the db name differing to work though.

Thanks @ringerc . I'll get a perf log if I face same problem again.

eulerto commented 3 years ago

Since this is a logical decoding issue that was fixed in 9.4.13, 9.5.8, and 9.6.4, we are closing this issue.

Reduce overhead in processing syscache invalidation events

Discussion: https://www.postgresql.org/message-id/CAHoiPjzea6N0zuCi%3D%2Bf9v_j94nfsy6y8SU7-%3Dbp4%3D7qw6_i%3DRg%40mail.gmail.com

2ndQuadrant / pglogical

replication in catchup state, wal sender high cpu usage #37