Closed alco closed 2 weeks ago
Could the bug be on our side?
Postgres official documentation explicitly states that LSNs grow monotonically:
WAL records are appended to the WAL files as each new record is written. The insert position is described by a Log Sequence Number (LSN) that is a byte offset into the WAL, increasing monotonically with each new record.
@kevin-dp There's probably a better title for this issue but I couldn't come up with one.
The bug lies in the assumption we make about LSNs of successive logical messages we receive from Postgres. Postgres definitely writes to WAL using monotonically increasing LSNs by definition, but that doesn't contradict the possibility of it being more lax when streaming logical messages to a replica. For example, Relation
and Type
messages aren't read directly from the primary's WAL but are generated on the fly.
There's something in the way that we store the latest seen LSN and how Postgres resumes a previously interrupted replication stream from the replication slot that leads to the conflict described in this issue.
Last time I checked, the Postgres documentation on this was not clear at all, and I ran into the same issue. The only real source of info for this is the Postgres source code.
In the original WAL, messages from concurrent transactions are interleaved. Then with logical replication, messages are re-ordered to group per transaction. This means that:
You can reproduce this by writing to Postgres using multiple transactions concurrently.
You can see how we handle this in PowerSync here.
@rkistner Brilliant insight and at the right time. Thanks!
I've rethought our approach to keeping track of processed LSNs and reporting them back to Postgres.
We see occasional test failures on CI caused by receiving a logical message with an LSN that is lower than the LSN of the previously received message:
I've also seen it happen once during normal operation of the sync service:
My attempts to reproduce it have been unsuccessful. Further investigation is needed to determine the conditions that lead to this failure.