MaterializeInc / materialize

The data warehouse for operational workloads.
https://materialize.com
Other
5.66k stars 457 forks source link

storage/sinks: support `KEY FORMAT` / `VALUE FORMAT` syntax #26787

Open morsapaes opened 2 months ago

morsapaes commented 2 months ago

Feature request

As is, we don't support specifying different formats for the key and value of sinked Kafka records. This is inconsistent with the semantics of Kafka sources (#20135), and prevents users from opting out of using complex types for the key (which has known issues in itself). We should introduce the KEY FORMAT / VALUE FORMAT options also for sinks, to allow emitting text and bytea keys in sinked Kafka records.

Original ask (Slack)


Note: might be a good one to bundle up with #23925.

rjobanp commented 1 week ago

Currently when using FORMAT avro we will publish the generated key and value schemas to the schema registry (if a key is provided).

What should happen in the case of KEY FORMAT json VALUE FORMAT avro ? Should we publish the key JSON schema to the registry too? Or should we just publish the value avro schema and leave the key schema unset in the registry?

Similarly, what about KEY FORMAT avro VALUE FORMAT json? And KEY FORMAT text VALUE FORMAT avro? It looks like our schema registry crate allows specifying either avro, proto, or json schemas, but not text/bytes.

benesch commented 1 week ago

What should happen in the case of KEY FORMAT json VALUE FORMAT avro ? Should we publish the key JSON schema to the registry too? Or should we just publish the value avro schema and leave the key schema unset in the registry?

The {KEY|VALUE} FORMAT is designed to tell you whether or not to use the schema registry. At least it works like this for sources. You may need to bang on it to get to parity for sinks.

With Avro, you have to specify a USING option to specify schema behavior. But, for sources, you're not limited to just the CSR! You can also provide the schema inline:

KEY VALUE FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY
KEY VALUE FORMAT AVRO USING SCHEMA '<inline schema>'

So, turning back to JSON, the vision here is this:

# The only thing we support today.
KEY VALUE FORMAT JSON

# Something we'll add support for eventually.
KEY VALUE FORMAT JSON USING CONFLUENT SCHEMA REGISTRY ...

Adding support for JSON + CSR is tracked in https://github.com/MaterializeInc/materialize/issues/7186. Recommend you don't go down that road now! Mapping Materialize relations to JSON schema is nontrivial.