confluentinc / bottledwater-pg

Change data capture from PostgreSQL into Kafka
http://blog.confluent.io/2015/04/23/bottled-water-real-time-integration-of-postgresql-and-kafka/
Apache License 2.0
3 stars 149 forks source link

Same table names in different schemas of the same database #93

Closed mcapitanio closed 8 years ago

mcapitanio commented 8 years ago

Hello,

maybe this is a silly question, but it seems to me that now BottledWater keeps information about the Postgres schema just in the namespace of the key and value of Avro, so for example for the same table named "table_1" belonging both to a schema "schema_1" and to a schema "schema_2" of a same database "database" (and maybe with different schema structure) what would happen? The same topic would be used for both the tables?

Could make sense to incorporate the source Postgres schema name in the name of the target topic of BottledWater? I see that there is already a "--topic-prefix" option but it has a database scope. BottledWater could automatically create a topic named:

      [topic-prefix]-<schema>-<table>

It would also be useful to make the topic separator a further option "--topic-separator". Some tools (ie. PrestoDB) reads the "." as a separator to identify the database name and the topic created by BottledWater could be separated by a "." instead of a "-".

samstokes commented 8 years ago

Hi @mcapitanio, I think you are right that the current behaviour would be that updates from both tables would go to the same topic.

I'm not too familiar with Postgres schemas or how they're generally used. I'd be interested to learn more about your use case for having the different schemas, if you're able to share.

Your proposal to include the schema as a prefix in the topic name makes sense (probably omitting the prefix for the default public schema). We'd welcome a pull request to add this, and happy to advise. Currently the on_table_schema callback in bottledwater.c is where we pull out the Avro schema name and set it as the topic name - that would be a good place to also pull out the Postgres schema name from the Avro schema namespace, and construct the qualified name for use as the topic name.

(Side note: the terminology gets pretty confusing: we've got the Postgres schema, the table schema, and the Avro schema, and to make it even more confusing we're using the Avro schema namespace name to store the Postgres schema name :))

mcapitanio commented 8 years ago

Hi @samstokes , yeah, we should use prefix also in the terminology! :)

We are familiar with the concept of database-schemas with Oracle database, but it seems to me that the use case we adopted with Oracle db instances could be adopted also with Postgres ones. This use case for us is using different oracle-database-schema for different application, so each application has its own database-schema. With a weak data governance could happen that different applications, maybe developed by different teams without a shared phisical data model (maybe because they don't need one, not only because of a weak data governance...), could create in their own postgres-database-schema, living in the same postgres-database, tables with the same name.

So, I'll take a look at the bottledwater.c class and I'll be glad to open a new pull request on this, hopefully in a short time.

mcapitanio commented 8 years ago

Hi @samstokes ,

I have just pulled #102 to integrate the original Postgres table schema name in the target Kafka topic name.

Of course every kind of comment and advise from you is really greatly appreciated and wellcome!

samstokes commented 8 years ago

Thank you very much for the pull request! I'll try to take a look and respond with comments in the next couple of days.

deem0n commented 8 years ago

Hello guys!

PotgreSQL schema prefix is absolutely required for any large scale project and I greatly appreciate PR #102 !

Confusingly enough, PostgreSQL schema is equivalent of MySQL database. But for the purpose of the discussion it is just table prefix in SQL in both cases. Oracle user is the same concept which is also confusing.

The good news is that all databases use same SQL syntax schema_name.table and that is all we need to know to support it in Kafka topics. Way to get schema_name from database WAL logs is database specific though.

mcapitanio commented 8 years ago

Bello,

yes, totally agree, pratically a schema it's just a table prefix.

One further and very little enhancement could be change the BottleWater constant which represents the prefix separator from "-" to "." so to have for Kafka topics the same "naming convention" there is in the database, so:

.. where is the original BottledWater -p option which could be used in general to distinguish among different database name instances.
samstokes commented 8 years ago

Thanks a lot @mcapitanio and @deem0n for explaining the use cases for schemas. My usage of Postgres has always been in a cloud environment where each app gets its own Postgres database, but I can totally see how schemas would be essential if you have multiple apps sharing the same database.