confluentinc / bottledwater-pg

Change data capture from PostgreSQL into Kafka
http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/
Apache License 2.0
2 stars 149 forks source link

"flush" wal_pos reported in keepalive unless we have pending transactions #73

Closed samstokes closed 8 years ago

samstokes commented 8 years ago

The server may generate WAL that doesn't correspond to a transaction commit, e.g.:

Currently we only update our "flush lsn" when we see a COMMIT, so if enough of this non-committed-transaction activity occurs before we see a COMMIT, the server may run out of WAL space and crash. I believe this is the root cause of #68.

The server's regular keepalive messages are already informing us of the updated wal_pos resulting from this activity. So this commit makes us acknowledge it as "flushed" (even though we didn't actually do anything about it), provided that we aren't currently waiting for Kafka to acknowledge a transaction in flight (in which case acknowledging the subsequent activity would could result in us losing the transaction in flight, in case of a restart).

I've also tested the case where Bottled Water restarts before the transaction is committed, and it does pick up from the right point and capture the whole transaction.

In the process of debugging this I ran into a bug in the client's logic for keeping track of transactions in flight, where the client would acknowledge each commit twice (each time it processed a new COMMIT, it would first reprocess the COMMIT of the previous transaction). I think the bug was harmless, but it made the logic hard to reason about, since effectively it was processing records since "removed" from the circular buffer abstraction. Since this change allows keepalives during that interval to update the fsync_lsn, it would also trigger the "Commits not in WAL order!" warning each time a transaction came in.

To work around that, I also refactored the circular buffer of transactions to remove some special cases in the handling of the initial snapshot transaction and distinguish more clearly between the buffer being empty and full.

I initially planned to write automated tests for this behaviour by querying pg_replication_slots.restart_lsn; however, the very async nature of the communication made it difficult to make the tests pass without large (20 second) sleeps - Bottled Water only checkpoints every 10 seconds, Postgres only issues keepalives once per checkpoint_timeout, etc. So right now there is no additional test coverage. It passes the existing tests, though :)

samstokes commented 8 years ago

I'm going to merge this to make progress, since it fixes a known and serious bug and has no observed deleterious effects. Still interested in any feedback (/ping @ept if you get a spare minute :)).