confluentinc / bottledwater-pg

Change data capture from PostgreSQL into Kafka
http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/
Apache License 2.0
3 stars 149 forks source link

Optionally specify sort order for initial snapshot #110

Open kinghuang opened 8 years ago

kinghuang commented 8 years ago

The initial database snapshot does a SELECT * which returns rows in an unspecified order. This can cause the initial batch of messages in Kafka to appear "out of order" compared to when the rows were inserted in the database.

I'm trying to use Bottled Water to capture changes in a table that contains log messages, which have an order to them. The potentially jumbled order of the initial snapshot makes the initial messages not very useful. In my case, simply sorting on the primary key would be sufficient.

samstokes commented 8 years ago

@kinghuang that's an interesting use case. Roughly how big (bytes or rows) is that table? Is the primary key an autoincrementing integer, or are you using a timestamp as the primary key? How does the volume of writes compare to the volume of reads?

I can see why you'd want to preserve the order of logs, but in general relational databases don't make many guarantees about ordering, so they're not really optimised for that. Given Kafka is pretty optimised for order-preserving delivery of messages, is there a reason you don't want to write the log messages directly to Kafka?

The main concern I'd have is that the initial snapshot is already an expensive operation, and adding an ORDER BY would make it more so. I'm not an expert on Postgres query optimisation, but I'm pretty sure even an index scan is more expensive than a simple sequential scan through the table. (I did a very quick microbenchmark suggesting it was about 50-100% slower.)

(The configuration for which ORDER BY clause to use for which table could also get pretty complex. Bottled Water already knows which columns are present in the primary key, so it could try and default to the primary key for each table; but the primary key wouldn't in general be the right thing to order by.)

kinghuang commented 8 years ago

Roughly how big (bytes or rows) is that table? Is the primary key an autoincrementing integer, or are you using a timestamp as the primary key? How does the volume of writes compare to the volume of reads?

There's two tables in a sample dump of an application's database that I'm working with right now. Table A has 37000 rows; table B has 638000 rows. The primary key is an autoincrementing integer in both tables. They contain activity logs from an application, and have very high write but low read volumes. The numbers will be much higher in the production system.

…, is there a reason you don't want to write the log messages directly to Kafka?

I very much wish that were possible! :) This is a third-party application. It doesn't offer any choices on where it writes these particular logs. We're processing these tables to feed a real-time event system.

One thing that occurred to me after I opened this issue is that I'm running into this problem because I'm processing existing data for use during development. Bottled Water works fine for new rows as they're written when the application is running.

That said, my expectations when I first tried this was that the messages in Kafka would reflect the order that the rows were written. This has held true for table A, but not table B. So this is a specific use case where the table rows have a natural order to them, and I'd like that reflected in the initial snapshot that Bottled Water makes.