datacontract / datacontract-specification

The Data Contract Specification Repository
https://datacontract.com/
MIT License
268 stars 40 forks source link

Support Confluent Schema Registry server #8

Closed databius closed 1 year ago

databius commented 1 year ago

Add new Confluent Schema Registry server. Example:

servers:
  my-stage:
    type: confluent_schema_registry
    host:
    subjects:
      compatibility:
      name:
      type:
      schema:
jochenchrist commented 1 year ago

Thank you for creating this issue. Could you please provide an example?

databius commented 1 year ago

Sure, let me share the context first.

I'm learning about Data Contracts. But that concept is quite new and lacks a universal tool for implementation in production. Some authors suggest handling Data Contracts with familiar tools. In my case, we are defining schemas using Avro and they are managed by the Schema Register. In addition, dbt is also used to check data quality.

IMO the data contract should be a single source of truth so it should be stored in a centralized repository and I am trying to find a common format for them.

Last week I tried with the Open Data Contract Standard. It's a great template but I believe Data Contract Specification is the best data contract format so far.

Back to the issue. In fact, the Schema Register does not store data, it only stores metadata containing the most important part of contract: the schema. One thing the Schema Registry does well is schema evolution, and I think we can reuse it. Actually, I still have to register the schemas (in the data contract) into the Schema Register because a lot of services depend on it.

In the above example, I think we should define the schema_registry in the server session. But I just reliazed that a schema can be registered to multiple topics, even multiple Schema Registry servers. If we repeated the schema (which might be thousands of lines) for each server, the contract would be very long.

So I propose new example:

schema:
  type: avro
  specification: |-
    {
      "type": "record",
      "name": "SomethingShared",
      "namespace": "com.databius.shared",
      "fields": [
        {
          "name": "greeting",
          "type": "string"
        }
      ]
    }
  registry:
    - name: dev
      type: confluent_schema_registry
      host: http://localhost:8081
      subjects:
        name: com.databius.shared.SomethingShared1-value
        compatibility: FORWARD_TRANSITIVE
    - name: prod
      type: confluent_schema_registry
      host: http://localhost:8081
      subjects:
        name: com.databius.shared.SomethingShared2-value
        compatibility: FORWARD_TRANSITIVE

Note: I only have experience with Confluent Schema Registry.

databius commented 1 year ago

The example is inspired by schema-registry-gitops.

I tried to implement a simple script to extract schema from the contract and register to Schema Registry using schema-registry-gitops. The results are quite promising.

jochenchrist commented 1 year ago

Thanks for the clarification, I understand the flow and use case, which sound nice.

For simplicity, I would vote to keep the registry information as a custom field, but not include it to the general data contract specification, as the use case is quite limited to Confluent/Kafka...

If there are more votes to add registries, we can discuss to open the issue again.