MeltanoLabs / tap-postgres

Singer Tap for PostgreSQL
https://hub.meltano.com/extractors/tap-postgres--meltanolabs/
Other
19 stars 23 forks source link

Document potential WAL issues #402

Open visch opened 6 months ago

visch commented 6 months ago

From this slack thread https://meltano.slack.com/archives/C069CQNHDNF/p1711361244627669

We want to document the manual steps to debug LSN even though this wasn't with our tap we utilize wal2json and it should be pretty similar to pipelinewise's implementation.

Hello 
[@Ryan](https://meltano.slack.com/team/U06C5H3MXE2)
. Whenever that happened to us it has always been than the Postgres instance was having (a lot) of trouble in allocating resources to deliver WAL messages, either because the database was busy or there were very big transactions and loading those requires a lot of resources.
That got confirmed in the end by manually logging in to the Postgres instance and using the function pg_logical_slot_peek_binary_changes() . Even setting the upto_lsn to 1 lead to the function taking about 45 minutes to start showing content, when it is usually immediate.
In our case, we believe a very big transaction was causing the problem in rendering some messages; once those messages were consumed everything went back to normal (edited) 

Ryan
  [10 days ago](https://meltano.slack.com/archives/C069CQNHDNF/p1711361942642209?thread_ts=1711361244.627669&cid=C069CQNHDNF)
Hi 
[@Raül Bonet](https://meltano.slack.com/team/U06QA67NC9J)
! Appreciate the response! Did you guys did something to the states (or config) or just waited? If it is the last one.. this one has been stuck on the Waiting for first wal message since last Friday

Ryan
  [10 days ago](https://meltano.slack.com/archives/C069CQNHDNF/p1711361984695879?thread_ts=1711361244.627669&cid=C069CQNHDNF)
peeking at the changes tho seems an interesting approach to debug this further, i’ll check that!

Raül Bonet
  [10 days ago](https://meltano.slack.com/archives/C069CQNHDNF/p1711362241081629?thread_ts=1711361244.627669&cid=C069CQNHDNF)
No problem, happy to try to save you the nightmare that was for us debugging that haha.
The only thing we did was manually was unsetting some kind of config parameter that made the pipeline stop after some time. max_run_seconds I think it is. If not the pipeline would stop and start again and all the process was lost.
Apart from that, yes, just wait, but in our case was maybe 1 hour maximum...
Other things you can check: at the beginning of the execution, the pipeline says something like requesting WAL messages up to LSN someting . You can use the pg_replication_slots to see the confirmed_flush and what's the state of the slot.
But yeah, definitely try peeking changes! If that also takes forever, the problem is in the database itself, not your pipeline

Ryan
  [10 days ago](https://meltano.slack.com/archives/C069CQNHDNF/p1711367195713629?thread_ts=1711361244.627669&cid=C069CQNHDNF)
you’re a lifesaver man! In our case it was not max_run_seconds that was causing the pipeline to be cut off (it was already at 3600 seconds) but the logical_poll_total_seconds was set to 120 seconds, and it was also a case of a heavy transaction
:eyes:
1
:np:
1

Derek Visch (AutoIDM)
  [10 days ago](https://meltano.slack.com/archives/C069CQNHDNF/p1711370669728349?thread_ts=1711361244.627669&cid=C069CQNHDNF)
Great info here, just wanted to say that's amazing. 
[@Raül Bonet](https://meltano.slack.com/team/U06QA67NC9J)
 would it be ok for me to capture this and throw it in the readme for meltanolab's tap-postgres I know it's not the same tap but I'm pretty sure we'd hit the same thing too!

Derek Visch (AutoIDM)
  [10 days ago](https://meltano.slack.com/archives/C069CQNHDNF/p1711370713454629?thread_ts=1711361244.627669&cid=C069CQNHDNF)
I wonder if it's caused by the wal2json plugin, I have some consternation about that plugin myself but at the end of the day I don't want to run a different one :confused: so we used the same thing
Saved for later • Due 3 days ago

Raül Bonet
  [5 days ago](https://meltano.slack.com/archives/C069CQNHDNF/p1711819229561179?thread_ts=1711361244.627669&cid=C069CQNHDNF)
Hello 
[@Derek Visch (AutoIDM)](https://meltano.slack.com/team/U06CCB0EUBC)

Sorry, I have been sick and not able to follow up on this!
Yes please, I must say huge part of the merit goes to 
[@maarten_van_gijssel](https://meltano.slack.com/team/U06C942PYLV)
 (Github [here](https://github.com/vgijssel)) , my tech lead at HackerOne with whom we debugged this (in case you want to acknowledge somebody :wink:).
About wal2json: we had the same theory. The version of the tap that we are using does not leverage the write-in-chunks option either, which probably only makes things worse.
I am very curious about your reservations and about why you don't want to run a different one. We were concerned about its efficiency in the decoding process; in the end, a binary format like pgoutput should be faster. Could you elaborate on your reasons?