Closed ovv closed 6 years ago
I'm getting the same. Critical bug indeed since all replication is crashing.
This appears to be an issue if the node is running PostgreSQL 10.5.
I'm running a 10.5 node and 10.3 subscriber and seeing this issue. I'm not seeing this issue with a 10.3 node and a 10.5 subscriber.
Unfortunately I'm not seeing anything other interesting in the logs that feels like it might be helpful.
Yep I have same problem in PostgreSQL 10.5 version, I had the same versions in two nodes and show me:
ERROR: no data left in message
The SO is CentOS 7:
PostgreSQL 10.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit
Likewise - source PostgreSQL 9.5 on Ubuntu, targets PostgreSQL 10.5 (previously working on 10.4) on Ubuntu.
We ended up moving to Postgres native replication and moving away from pglogical. I highly recommend it if you don't use the advanced features of pglogical and are in a hurry to get this working. Super simple to setup and works great on 10.5 as well. My guess is that it will keep working better with upgrades as well since it is a native module and part of the regular test/release cycle.
The only thing we had to adjust was getting rid of some truncate statements in nightly jobs and change them to delete statements instead, since truncate is not supported in native replication until Postgres 11.
If you are unsure if your current pglogical replication will work with native replication, you can read more about limitation and differences here:
https://blog.2ndquadrant.com/pglogical-logical-replication-postgresql-10/
Same problem here with PostgreSQL 9.6 and pglogical 2.2.0
@tvarsis - unfortunately switching to Postgres native replication is not an option for us as we have a primary server on 9.5. Maybe when we have that one migrated to 10 :)
Actually - does anyone know if it would it be feasible to setup pglogical without conflict resolution to avoid this issue?
FYI, we've just identified this as a likely ABI break in PostgreSQL. Make sure your pglogical is compiled against 10.5 if you run against 10.5; this issue is likely to only affect pglogical compiled against 10.4 and running against 10.5.
You can work around this by recompiling pglogical against 10.5.
Awesome thanks Is there any plan to update the version available in 2ndquadrant debian repository ?
Yes, more to come. We're preparing a hackers post and some updates.
I'm at least also seeing this same error message after restarting Debian PG 9.6 servers (the pglogical source is from the 2ndQuadrant repository for Debian stretch).
Package version 2.2.0-1.stretch+1, maybe recent security updates have broken something?
It's an issue with the latest point release. You must ensure your logical decoding output plugins (pglogical, bdr, etc) are built with the same PostgreSQL point release as the running PostgreSQL. If you're running a plugin built on 10.4 on 10.5, it'll crash. Similarly, if you run a plugin built on 10.5 on 10.4, that'll crash too. This affects all the point releases not just 10.x.
I'll link the hackers post with details soon.
Cc @mjevans
Is there any additional word on this? I upgraded postgres to 9.6.10 from 9.6.9 and it has broken my logical replication using version 2.2.0. I tried to make pglogical from the source against the newer version of postgres which ended up reverting me to pglogical version 2.0.0 somehow, but got it installed and I'm still failing in the same way. This is a critical production system here, so any advice would be appreciated.
2018-08-27 14:45 GMT-03:00 greigwise notifications@github.com:
Is there any additional word on this? I upgraded postgres to 9.6.10 from 9.6.9 and it has broken my logical replication using version 2.2.0. I tried to make pglogical from the source against the newer version of postgres which ended up reverting me to pglogical version 2.0.0 somehow, but got it installed and I'm still failing in the same way. This is a critical production system here, so any advice would be appreciated.
There should be new packages for pglogical available. Can you configure the repository and try downloading apt/yum?
https://dl.2ndquadrant.com/default/release/site/
--
Martín Marqués http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
2018-08-27 14:45 GMT-03:00 greigwise notifications@github.com:
Is there any additional word on this? I upgraded postgres to 9.6.10 from 9.6.9 and it has broken my logical replication using version 2.2.0. I tried to make pglogical from the source against the newer version of postgres which ended up reverting me to pglogical version 2.0.0 somehow, but got it installed and I'm still failing in the same way. This is a critical production system here, so any advice would be appreciated.
If it's really critical the use of pglogical in this environment, I would recommend getting in touch with info@2ndquadrant.com to see how you can get help here.
--
Martín Marqués http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
@ringerc: I'm very interested in the details of this ABI change because this sort of issue is the reason Debian is usually very conservative about updating packages to new versions in the stable release - where PostgreSQL has a blanket exception and new upstream versions are simply waived through because the PG project has a history of being careful about not breaking things. Do you have any pointers?
Oh btw, recompiling pglogical 2.2.0 does not fix the breakage (as per the regression tests) on Debian.
Well, we just got the latest version (2.2.0-3) installed and we're seeing the same error. Is it possible postgres needs restarted?
2018-08-27 15:47 GMT-03:00 Christoph Berg notifications@github.com:
Oh btw, recompiling pglogical 2.2.0 does not fix the breakage (as per the regression tests) on Debian.
Could you share those regression tests output?
Sorry, I don't have a debian server handy.
--
Martín Marqués http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
2018-08-27 16:02 GMT-03:00 greigwise notifications@github.com:
Well, we just got the latest version (2.2.0-3) installed and we're seeing the same error. Is it possible postgres needs restarted?
Of course!
Else how would the shared_library get reloaded ;)
Let us know how it went there.
--
Martín Marqués http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Could you share those regression tests output?
There's two tests, the first is pg_regress which passes but takes horribly long, at least for 9.5:
14:10:01 /bin/mkdir -p regression_output
14:10:01 PATH="./tmp_install/usr/lib/postgresql/9.5/bin:$PATH" LD_LIBRARY_PATH="./tmp_install/usr/lib/x86_64-linux-gnu" /usr/lib/postgresql/9.5/lib/pgxs/src/makefiles/../../src/test/regress/pg_regress --inputdir=./ --temp-instance=./tmp_check --bindir= \
14:10:01 --temp-config ./regress-postgresql.conf \
14:10:01 --temp-instance=./tmp_check \
14:10:01 --outputdir=./regression_output \
14:10:01 --create-role=logical \
14:10:01 preseed infofuncs init_fail init preseed_check basic extended conflict_secondary_unique toasted replication_set add_table matview bidirectional primary_key interfaces foreign_key functions copy triggers parallel row_filter row_filter_sampling att_list column_filter apply_delay multiple_upstreams node_origin_cascade drop
14:10:01 ============== creating temporary instance ==============
14:10:01 ============== initializing database system ==============
14:10:05 ============== starting postmaster ==============
14:10:06 running on port 57746 with PID 6417
14:10:06 ============== creating database "regression" ==============
14:10:06 CREATE DATABASE
14:10:06 ALTER DATABASE
14:10:06 ============== creating role "logical" ==============
14:10:06 CREATE ROLE
14:10:06 GRANT
14:10:06 ============== running regression test queries ==============
14:10:06 test preseed ... ok
14:10:06 test infofuncs ... ok
14:10:07 test init_fail ... ok
14:10:08 test init ... ok
14:10:08 test preseed_check ... ok
14:10:55 test basic ... ok
14:26:16 test extended ... ok <-- 15 minutes for 'extended'
14:26:43 test conflict_secondary_unique ... ok
14:26:46 test toasted ... ok
14:26:47 test replication_set ... ok
14:27:07 test add_table ... ok
14:27:11 test matview ... ok
14:27:24 test bidirectional ... ok
14:28:52 test primary_key ... ok
14:28:54 test interfaces ... ok
14:28:56 test foreign_key ... ok
14:29:45 test functions ... ok
14:29:47 test copy ... ok
14:29:53 test triggers ... ok
14:29:56 test parallel ... ok
14:30:36 test row_filter ... ok
14:30:38 test row_filter_sampling ... ok
14:30:57 test att_list ... ok
14:31:10 test column_filter ... ok
14:31:17 test apply_delay ... ok
14:31:21 test multiple_upstreams ... ok
14:31:24 test node_origin_cascade ... ok
14:31:25 test drop ... ok
The prove test then fails without any useful output:
15:09:14 t/010_pglogical_create_subscriber.pl ..
15:09:14 1..11
15:09:14 ok 1 - pglogical_create_subscriber --help exit code 0
15:09:14 ok 2 - pglogical_create_subscriber --help goes to stdout
15:09:14 ok 3 - pglogical_create_subscriber --help nothing to stderr
15:09:14 ok 4 - pglogical_create_subscriber with invalid option nonzero exit code
15:09:14 ok 5 - pglogical_create_subscriber with invalid option prints error message
15:09:14 ok 6 - pglogical_create_subscriber check
15:09:14 ok 7 - preseed check 1
15:09:14 ok 8 - preseed check 2
15:09:14 ok 9 - replication check 1
15:09:14 ok 10 - replication check 2
15:09:14 ok 11 - replication check 3
15:09:14 ok
15:09:37 t/020_non_default_replication_set.pl ..
15:09:37 1..1
15:09:37 ok 1 - replication check
15:09:37 ok
15:09:37 All tests successful.
15:09:37 Files=2, Tests=12, 67 wallclock secs ( 0.04 usr 0.02 sys + 7.74 cusr 3.86 csys = 11.66 CPU)
15:09:37 Result: PASS
15:09:45 Bailout called. Further testing stopped: system psql failed
15:09:45 FAILED--Further testing stopped: system psql failed
@mnencia was also looking at the output, btw.
Christoph
hi @ChristophBerg did you look in the tmp_check/log directory? The useful output from prove should be there.
So we're looking at https://dl.2ndquadrant.com/default/release/browse/apt/pool/main/p/pglogical/ per https://dl.2ndquadrant.com ; there are -3 builds there like postgresql-10-pglogical_2.2.0-3.xenial+1_amd64.deb
, which has
$ dpkg-deb -I ~/Downloads/postgresql-10-pglogical_2.2.0-3.xenial+1_amd64.deb
...
Depends: libc6 (>= 2.4), libpq5 (>= 9.1~), postgresql-10 (>= 10.5)
...
and was built against Pg 10.5.
Or rather, the 9.5 equivalent pkg.
@ChristophBerg Are you using PGDG postgres, or debian postgres? 2ndQ-packaged pglogical or Debian packaged pglogical?
(apologies if this should be obvious from context, struggling for time)
I have installed latest pglogical:
root@hostname:~# dpkg -l postgresql-10
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-============================-===================-===================-=============================================================
ii postgresql-10 10.5-1.pgdg16.04+1 amd64 object-relational SQL database, version 10 server
root@hostname:~# dpkg -l postgresql-10-pglogical
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-============================-===================-===================-=============================================================
ii postgresql-10-pglogical 2.2.0-3.xenial+1 amd64 pglogical plugin for PostgreSQL 10
Still I'm getting these on standby:
2018-08-28 07:38:50.778 UTC [10443] [unknown]@database LOG: starting apply for subscription subscription
2018-08-28 07:38:50.791 UTC [10443] [unknown]@database ERROR: no data left in message
2018-08-28 07:38:50.791 UTC [10443] [unknown]@database LOG: apply worker [10443] at slot 1 generation 16 exiting with error
2018-08-28 07:38:50.792 UTC [10179] LOG: worker process: pglogical apply 16384:2875150205 (PID 10443) exited with exit code 1
and these on master:
2018-08-28 07:38:50.786 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,1,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"starting logical decoding for slot ""pgl_database_provider_subscription""","streaming transactions committing after 6C5/C36401A8, reading WAL from 6C5/C3630C78",,,,,,,,"subscription"
2018-08-28 07:38:50.786 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,2,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"logical decoding found initial starting point at 6C5/C3630C78","Waiting for transactions (approximately 1) older than 611497951 to end.",,,,,,,,"subscription"
2018-08-28 07:38:50.787 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,3,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"logical decoding found consistent point at 6C5/C3631808","There are no running transactions.",,,,,,,,"subscription"
2018-08-28 07:38:50.792 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,4,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,08006,"could not receive data from client: Connection reset by peer",,,,,,,,,"subscription"
2018-08-28 07:38:50.793 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,5,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,08P01,"unexpected EOF on standby connection",,,,,,,,,"subscription"
2018-08-28 07:38:50.793 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,6,"idle",2018-08-28 07:38:50 UTC,,0,LOG,00000,"disconnection: session time: 0:00:00.011 user=pglogical database=database host=192.168.0.100 port=44908",,,,,,,,,"subscription"
El mar., 28 ago. 2018 04:43, raiviskrumins notifications@github.com escribió:
Still I'm getting these on standby:
2018-08-28 07:38:50.778 UTC [10443] [unknown]@database LOG: starting apply for subscription subscription 2018-08-28 07:38:50.791 UTC [10443] [unknown]@database ERROR: no data left in message 2018-08-28 07:38:50.791 UTC [10443] [unknown]@database LOG: apply worker [10443] at slot 1 generation 16 exiting with error 2018-08-28 07:38:50.792 UTC [10179] LOG: worker process: pglogical apply 16384:2875150205 (PID 10443) exited with exit code 1
and these on master:
2018-08-28 07:38:50.786 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,1,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"starting logical decoding for slot ""pgl_database_provider_subscription""","streaming transactions committing after 6C5/C36401A8, reading WAL from 6C5/C3630C78",,,,,,,,"subscription" 2018-08-28 07:38:50.786 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,2,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"logical decoding found initial starting point at 6C5/C3630C78","Waiting for transactions (approximately 1) older than 611497951 to end.",,,,,,,,"subscription" 2018-08-28 07:38:50.787 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,3,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,00000,"logical decoding found consistent point at 6C5/C3631808","There are no running transactions.",,,,,,,,"subscription" 2018-08-28 07:38:50.792 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,4,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,08006,"could not receive data from client: Connection reset by peer",,,,,,,,,"subscription" 2018-08-28 07:38:50.793 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,5,"idle",2018-08-28 07:38:50 UTC,12/0,0,LOG,08P01,"unexpected EOF on standby connection",,,,,,,,,"subscription" 2018-08-28 07:38:50.793 UTC,"pglogical","database",4212,"192.168.0.100:44908",5b84fc0a.1074,6,"idle",2018-08-28 07:38:50 UTC,,0,LOG,00000,"disconnection: session time: 0:00:00.011 user=pglogical database=database host=192.168.0.100 port=44908",,,,,,,,,"subscripti
Did you restart the provider and subscriber? Are both nodes running upgraded versions of Postgres and pglogical?
I had the same errors when trying to set up replication from PostgreSQL v9.6.10 to v10.5. After installing the new pglogical packages from https://dl.2ndquadrant.com/default/release/browse/apt/pool/main/p/pglogical/ for both PostgreSQL versions and restarting both clusters, replication started to work again.
I'm using PostgreSQL packages from the PGDG repository on Debian Stretch and installed the "stretch" versions of the pglogical packages.
So, after updating the packages and restarting postgres on both sides, it worked for me also. shared_preload_libraries = have to restart postgres after an update. lol
@greigwise Yes, it worked for me as well. Thank you!
The backstory here is that commit f49a80c48 on PostgreSQL master accidentally broke the binary-compatibility of the layout of struct ReorderBufferTXN
as part of fixing a couple of bugs. Since it was a bug fix, it was backported. The ABI change didn't get noticed, so the change landed in releases 10.5, 9.6.10, 9.5.9, 9.4.19 and 9.3.24, breaking the ABI for logical decoding output plugins.
There's discussion in PostgreSQL infrastructure team about whether ABI-checking is feasible to add to the build-farm, and there's soon going to be some discussion on pgsql-hackers about how to avoid this in future too. PostgreSQL tries extremely hard to keep patch releases backward compatible and very safe to update to, so changes will be made to stop it happening again.
This means the issue affects any other logical decoding output plugin like wal2json
etc too. But not pgoutput
or test_decoding
since they're built as part of PostgreSQL itself.
We addressed this for pglogical by updating the packaging to add a new dependency on the post-ABI-break minor release. So we ensure we only build against that release or later and we only install against that release or later releases. It forces people to update, but they should anyway, and it's a lot safer than runtime attempts to compensate for struct layout changes.
@ringerc: PGDG packages, this is the apt.postgresql.org buildd. @alvherre: tmp_check/log: has this:
psql:t/basic.sql:21: ERROR: 42883: function pg_current_xlog_location() does not exist
LINE 1: SELECT pg_xlog_wait_remote_apply(pg_current_xlog_location(),...
(I'll leave the debugging to @mnencia, he knows the packaging of this package much better than I do.)
@ChristophBerg That looks like it's not properly handling the renaming of pg_current_xlog_location
to pg_current_wal_lsn
in Pg10. Likely unrelated.
After upgrading to postgresql 10.5
PostgreSQL 10.5 (Debian 10.5-1.pgdg90+1)
replications start failing withERROR: no data left in message
.It doesn't happen straight after the update but once there is the need for a conflict resolution (set for
apply remote
in our case)