confluentinc / bottledwater-pg

Change data capture from PostgreSQL into Kafka
http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/
Apache License 2.0
3 stars 149 forks source link

Design for adding to an existing database #99

Open samstokes opened 8 years ago

samstokes commented 8 years ago

Right now a few things make it tricky to trial Bottled Water on an existing database taking production traffic:

A related problem is what to do if Bottled Water is running on a primary server that fails, and we promote a standby to become the new primary. To keep the Kafka updates flowing, we probably want to point Bottled Water at the new primary, but:

This encompasses a lot of individual changes we might make (e.g. add a table whitelist, add a --skip-snapshot flag, integrate Zookeeper support so Bottled Water can create its own Kafka topics, etc), but it would be useful to have a plan for how this ought to work. Creating this issue to collect discussion about this topic!

badboyd commented 8 years ago

The initial snapshot does a SELECT * from all tables, which will probably do bad things to the database's cache unless the entire dataset (plus indices) can fit in RAM;

It can be worse, It can lock other transactions. BW do the snapshot like this SELECT row by row from all tables and non-stop. So I have a proposal for BW snapshot: fetch number of rows( for example 5000) send all to kafka, then wait for ACK then fetch next 5000 rows ? OR Can we do the snapshot on a standby server ? What do you think @samstokes ?

badboyd commented 8 years ago

101

samstokes commented 8 years ago

It can be worse, It can lock other transactions. BW do the snapshot like this SELECT row by row from all tables and non-stop. So I have a proposal for BW snapshot: fetch number of rows( for example 5000) send all to kafka, then wait for ACK then fetch next 5000 rows ? OR Can we do the snapshot on a standby server ?

Are you sure that the SELECT can block other transactions? My understanding was that Postgres' MVCC design meant that queries running on a consistent snapshot (at the REPEATABLE READ isolation level) would execute without blocking other transactions.

badboyd commented 8 years ago

Are you sure that the SELECT can block other transactions? My understanding was that Postgres' MVCC design meant that queries running on a consistent snapshot (at the REPEATABLE READ isolation level) would execute without blocking other transactions.

Yeah, It can not block other transactions according to Doc. But, when I start BW snapshot, other transactions start executing for much much more longer time than usual, some transaction cannot execute.

samstokes commented 8 years ago

I guess that makes sense - the long-running SELECT will cause a lot of CPU and I/O, and lots of table bloat if there is high write volume, both of which could slow down other queries.

some transaction cannot execute.

Do you mean that you observe some transactions aborting with an error? Do you see any specific error message, or are they just timing out?

I suppose to support snapshots of large databases we might need to either run the snapshot on a standby like you suggest, or teach Bottled Water how to restore Postgres backups directly.

samstokes commented 8 years ago

Unfortunately, according to this blog post, it doesn't look like there's currently a way to have a logical replication consumer like Bottled Water start consuming from a specific point in the past, which we'd need to avoid missing updates after either bootstrapping from a standby server or failing over to a promoted standby.