Closed fr-butch closed 3 years ago
Thanks for the detailed report. Sorry it took so long to respond, we're extremely busy on development and customer support.
This looks likely to be an issue with the relation metadata cache invalidation code. Have you been able to reliably reproduce this problem? Does it happen consistently?
Are you possibly able to attach 'gdb' to one of the high cpu using backends and get a few backtraces?
attach gdb, then
set pagination off
set logging on
bt full
info locals
cont
[control C]
bt full
info locals
cont
[control C]
bt full
info locals
cont
... repeat a few times then quit gdb and send the gdb.txt
that results. A pastebin site, gist, or whatever is fine, or attach directly here.
Continue execution each time, then interrupt again and do another 'bt full'
gdb.txt in pastebin http://pastebin.com/ME3Leysn
In my replication sets ~4500 tables. This happened after I remove a ~1500 replicate tables in provider database. Then I remove this tables in subscriber database, but this not help. Replication does not start to work.
postgres=#` select * from pg_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
28731 | 10 | postgres | walreceiver | 192.168.0.55 | | 48850 | 2016-10-14 15:24:36.33258+03 | | streaming | 199/3FDC0270 | 199/3FDC0270 | 199/3FDC0270 | 199/3FDC0160 | 0 | async
1647 | 10 | postgres | audit2 | 192.168.0.57 | | 45068 | 2016-10-17 12:38:25.805543+03 | 94914676 | catchup | 199/3D4F5868 | 199/3D47FFD8 | 199/3D47FFD8 | 199/3D47FFD8 | 0 | async
(2 rows)
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | active | active_pid | xmin | catalog_xmin | restart_lsn
--------------------------------------+------------------+-----------+----------+-----------------+--------+------------+----------+--------------+--------------
pgl_lsd_master_orig_provider1_audit2 | pglogical_output | logical | 60735254 | lsd_master_orig | t | 14980 | | 94912304 | 199/397F56F8
db2_slot | | physical | | | t | 28731 | 94915845 | | 199/3FE72260
(2 rows)
On 17 October 2016 at 17:48, azagirov notifications@github.com wrote:
This happened after I remove a few replicate tables in provider database. Then I remove this tables in subscriber database, but this not help. Replication does not start to work.
Interesting. From the debug info provided, it seems to be busy processing cache invalidations inside logical decoding on the upstream's walsender, rather than in pglogical's output plugin or downstream.
That's consistent with the 'ps' output etc you sent.
I can't tell from this whether it's in some very expensive/slow operation where we have something scaling O(big) or whether it's in an infinite loop and, if so, at what level it's looping. You'd need to do some more interactive debugging for that. A start would be to use gdb, with logging enabled, to run each function until finish (using the gdb command "finish"), walking up the call stack, and see which function never ends. Then interrupt with control-C and single-step-over ("next") to see what part of the function it's looping in, and how. Print variables with "info args", "info locals", etc.
Since you seem to be pretty good with the tools available hopefully this is something you're able to do.
If not, try to see if you can reproduce this in a cut down test case to isolate what exactly is causing the problematic behaviour. If you can work out a set of steps and inputs (SQL scripts, commands, etc) to reproduce this and attach it I should be able to take a look.
I currently have to prioritize some development and customer support work so I can't spend a long time trying to work out what could be going on and/or make a wholly synthetic test case, but I'm happy to help guide you to collect info and investigate in the hopes we can track the issue down.
Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
It's immediately noteworthy that you're using a LOT more tables than we typically deal with. So if we have something that's O(n^2)
for n
tables or anything like that, it'd be consistent with what you're seeing.
It would be interesting to try to create a slot using the test_decoding
output plugin and replay from it. See if you have the same performance issues there. If so, that tells us pretty conclusively that it's a performance issue in logical decoding in core PostgreSQL.
You can use pg_recvlogical
to create a test_decoding
slot and replay from it. You'll need the test_decoding
contrib installed.
There were some fixed in PostgreSQL recently that might help with this. I would recommend installing the new minor version that is coming next week.
I believe I've run across a very similar (if not the same) issue, that is being discussed & diagnosed on pgsql-general. This issue report helped me narrow down and reproduce test cases for the problem affecting me, so thanks for that.
(this probably isn't really a 2ndQuadrant/pglogical issue; it seems to be a core engine issue)
Yes, it's logical decoding issue so it affects everything using that, but we fix issues in core as well :)
Same here. Can not shutdown db, service shutdown command waits forever. While it is waiting for shutdown wallsender processes consume %100 CPU. Using PGLogical 2.2.1 and PostgreSQL 11.1.
Update: I think this was because of some tables having sync problems. Some table's sync_status
wasn't r
. So, I reconfigured replication and now PG can shutdown cleanly.
@derkan If you can reproduce the issue and get the chance to do a perf record --call-graph dwarf -u postgres
and perf report -g
that might be helpful.
Sorry, I can't reproduce now, but I've added #200 to get warning in creating a subscription to a remote db if db_name in dsn and current database name differs. This kind of error makes remotedb to use %100 on shutdown. (Although two db's schema are same.)
The db name should be able to differ; if it can't there is definitely a bug at work. The regression tests rely on the db name differing to work though.
The db name should be able to differ; if it can't there is definitely a bug at work. The regression tests rely on the db name differing to work though.
Thanks @ringerc . I'll get a perf log if I face same problem again.
replication stop working after 2 hours:
and got 100% cpu usage
in logs on provider:
on subscriber:
upstream Pg version is 9.5.2 -- this is strange, our postgresql server was 9.5.0 and then was updated to 9.5.4
I have one set with 1 table on provider
subscriber:
I have the same problem with previous subscription (audit1). So I have droped database on subscriber and droped extension pglogical on provider, updated postgresql packets, restart, . Then create database on subscriber and start over with no luck.
perf top for wal sender on provider
strace for wal sender:
postgresql.conf: