confluentinc / kafka-connect-jdbc

Kafka Connect connector for JDBC-compatible databases
Other
15 stars 954 forks source link

Support for uuid type in postgres #81

Open cddr opened 8 years ago

cddr commented 8 years ago

Tables with columns of type UUID in postgres cannot be "connected" by just listing the table in the whitelist because the UUID type is not handled by the converter.

One workaround is to use a SQL query instead and convert UUIDs to strings instead but obviously this isn't ideal.

xiongtx commented 7 years ago

One workaround is to use a query that converts the UUID to another type, e.g. a VARCHAR:

In Postgres:

SELECT
    test.id::varchar,
...
cddr commented 7 years ago

Another option (especially if you have a lot of tables that use unsupported types), is to create a "kafka_connect" schema or something, and use it to house views that do the conversion described above (and all the other "exotic" datatypes that remain unsupported like boolean and json).

The root problem is that many of these types look the same as far as jdbc is concerned. They all come through as Type.Other so the DataConverter does not have enough information to know how to handle it. Obviously there might be many different types that show up under this so it's not a trivial fix.

Would it be acceptable to add a map to the job config that provided additional information about how to convert columns whose type is unknown to the native data converter?

We could do this at a global level like...

{ "type.converters" => { "id" => OtherToString,
                    "deleted" => OtherToBoolean } }

...and if (for whatever reason, "id" is not always a UUID), on a per-table level like...

{ "type.converters" =>
      { "foo_table" => { "id" => OtherToString,
                            "deleted" => OtherToBoolean } },
      {  "bar_table" => { "deleted" => OtherToBoolean,
                            "request_info" => OtherToString } } }

This would also resolve #35

gwenshap commented 7 years ago

If I understand correctly, the discussion is around where/how to add UUID support. Two reasonable work-arounds were suggested above, but I want to discuss a solution that doesn't require changes to Postgres schemas.

There are basically 2 places that UUID awareness could be added:

  1. Connect data model
  2. JDBC Connector

If we add UUID as a new data type in the Connect data model, we can also modify Converters to support it when serializing/de-serializing. The problem there is that UUID type isn't all that common - Avro and JSON don't have a type for UUID AFAIK and neither do many data stores.

The other option is that the JDBC Connector will translate from UUID columns to a data type that Connect supports. We already do it with JDBC Number type and such (I think). This isn't as generic as updating the Connect API, but I think is a reasonable solution and shouldn't be difficult to implement (if you know what type to UUID can be stored with).

Pinging @rhauch for a second opinion :)

rhauch commented 7 years ago

It is good that there are several workarounds that work with the existing connector. @xiongtx's suggestion to use a query that casts the columns into the correct type is a great non-invasive technique, even if it requires a bit more table-specific configuration than is ideal.

@cddr's suggestion to define a schema in the database for Kafka Connect would probably work great for some folks that have the flexibility to change their database schema. But others will not have that option, and ideally the connector would not require any database changes in order to work properly.

Having mentioned the workarounds, I do think it's important to improve the connector to handle situations like this, and more generally to better handle DBMS-specific characteristics. For example, SQL Server's timestamp field is actually more of a row version number, so it needs to be handled differently than JDBC timestamps. And there are quite a few other examples, including at least one DBMS-specific variation in the current codebase.

I would suggest that we consider introducing a Dialect class that performs all of the data type conversions, quoting, and other logic that might be DBMS-specific. By default the connector would use the generic Dialect, but the connector task could also use the Connection to determine which subclass to use (and this could always be overridden by a configuration setting to allow people to implement their own custom dialects).

Debezium's connectors have something similar, with a JdbcValueConverters generic utility and concrete subclasses for MySQL and PostgreSQL that override only those methods that are needed. Note how the DBMS-specific subclasses handle DBMS-specific value conversions and schema building, ultimately delegating to the superclass for the typical types. It is important that the Dialect base class be designed to be easily extended without having to duplicate a lot of code; this is why Debezium's classes have so many methods that often have a bit of logic that delegates to other methods.

Taking this approach would require a fair amount of refactoring and effort, but I think it would pay off for this issue and make it far easier to add other DBMS-specific behaviors in the future.

We could add a UUID logical type, but we need to be careful about adding logical types since doing so requires changing every Converter implementation to support the logical types and also potentially other language bindings.

cddr commented 7 years ago

The other option is that the JDBC Connector will translate from UUID columns to a data type that Connect supports. We already do it with JDBC Number type and such (I think). This isn't as generic as updating the Connect API, but I think is a reasonable solution and shouldn't be difficult to implement (if you know what type to UUID can be stored with).

This would work for our needs.

The only problem is that with postgres at least, uuid columns are reported as Type.Other by jdbc. That's the motivation for introducing the type.converters concept above which would allow users to specify (for each column not handled natively) exactly how they want things converted rather than rely on whatever convention is adopted by the connector. Otherwise I can't think of any other way than to see if the value "looks like" a UUID and convert it to a string, or just unconditionally convert all values of Type.Other to strings neither of which seems satisfactory to me.

xiongtx commented 7 years ago

This conversation is about KC sources, but it seems there's a similar problem with KC sinks.

While it's possible to insert a string into a UUID column via PSQL CLI and have Postgres do auto-conversion, it's necessary to have a different PreparedStatement for String vs. UUID for JDBC.

When using a KC sink where a value has a String schema but a UUID logical type (that our own deserializer turns into a UUID), KC ignores the logical type metadata and just prepares the value as a string, which fails to insert.

One solution would be to have some KC-recognized Avro metadata that tells JDBC to use setObject(<column>, <value>, Types.OTHER). Another would be for KC (but preferably the rest of Kafka as well; see KAFKA-4353) to support UUID directly.

qiao-meng-zefr commented 6 years ago

Just want to comment that I'm having same issue with KC Sink.

avgojzack commented 4 years ago

Is this fixed by https://github.com/confluentinc/kafka-connect-jdbc/pull/477?