cpursley / walex

Postgres change events (CDC) in Elixir
MIT License
276 stars 14 forks source link

progress tracking for keep alive reply #64

Closed DaemonSnake closed 3 months ago

DaemonSnake commented 3 months ago

context:

When receiving a keep-alive request from Postgres, replying with Posgres' current wal_end + 1 notifies the Postgres server that we fully processed up-to its wal_end+1.

issue:

As Replication.Server communicates with Replication.Publisher asynchronously this means that, if an error occurs during the processing of a message, we can acknowledged many more messages than we actually processed.

For a durable slot, this means that when the durable slot will restart, it will do so at the last wal_end+1 we replied with (loosing events). The longer the transaction takes to be fully processed (many records, etc.) the higher the risk of this happening

solution proposed:

Adding a Replication.Progress Agent that stores in a :gb_sets (ordered set) the LSN of transactions. When we start receiving the transaction it we push it We drops it when the processing is done. In the Replication.Server, we then only need to get the wal_end of the smallest LSN in progress as keep-alive reply. If no transaction is in progress we can return the received wal_end+1 instead as currently.

DaemonSnake commented 3 months ago

oh forgot that the tests uses Registry.child_spec() from the other PR

cpursley commented 3 months ago

This is a good idea, what else is needed? I just merged your other small tweak branch.

DaemonSnake commented 3 months ago

oh sorry, I thought I had sent a comment with the closure of the PR

There a few things where I'm not enough certain of the outcome. I think it might be wrong to send 3 times the same value.

Int64 -> The location of the last WAL byte + 1 received and written to disk in the standby.
Int64 -> The location of the last WAL byte + 1 flushed to disk in the standby.
Int64 ->The location of the last WAL byte + 1 applied in the standby.

It's not that clear but I'm afraid that replying with the current wal_end +1 for the first field will cause Postgres re-send packets. I need to investigate further

cpursley commented 3 months ago

I see. Do you mind opening a Discussion on this topic? Maybe others have some insights.