devshawn / kafka-gitops

🚀Manage Apache Kafka topics and generate ACLs through a desired state file.
https://devshawn.github.io/kafka-gitops
Apache License 2.0
322 stars 71 forks source link

Feature request: managing schemas #50

Open RobinGoussey opened 3 years ago

RobinGoussey commented 3 years ago

Hi,

Right now there is nothing like this for schema management. It might be useful to also allow/declare what topics/subjects use what schemas:

config:
  #Where the schemas reside
  schemaDir: /tmp/output_dir/
  schema:
    registry:
      url: http://localhost:8081
#     username: test
#     password: test
schemas:
  - relativeLocation: Personnel.json
    # This is equals to confluent references:  https://docs.confluent.io/platform/current/schema-registry/develop/api.html#post--subjects-(string-%20subject)-versions
    references:
      - name: com.person.json
        subject: person
        version: -1
    # If left blank, it will auto submit to personnel-value
    subjects:
      - personnel-raw-value
      - personnel-refined-value
      - personnel-x-value

This would allow the topic state, and schema state to be managed by one tool/file. java -jar kafka-schema-gitops-1.0-SNAPSHOT-jar-with-dependencies.jar -i validate # or execute

devshawn commented 3 years ago

I really like this idea. I will definitely take this on in the future; though I'm not sure exactly when I'd have time to take it on. I could see it happening within the next couple of months.

jrevillard commented 3 years ago

I'm very much interested into this too. I'm ready to help with this.

Best, Jerome

devshawn commented 3 years ago

I love this idea and this would be an awesome feature to have. I think it fits in great with the other features of kafka-gitops. I've got some availability coming up and would be able to help out as well.

@jrevillard I'd be happy to have your help as well! With a feature this big, I'd like to do a bit of planning & outlining before we get started on the code. I'd like to make a few examples of how to structure the YAML and discuss.

jrevillard commented 3 years ago

Hello,

I just seen that you have a first implementation @Twb3 ! https://github.com/Twb3/kafka-gitops/commit/50fa5ccb54ec77ee618d4152436a66804783fa9c

What's the status ? Do you need help ?

Best, Jerome

HSA72 commented 3 years ago

Looking forward to having this feature @Twb3!

tball-dev commented 3 years ago

Hey guys sorry I did not see this earlier. I did a quick POC for myself to see what's possible. I think I've got it mostly nailed down. I hope to propose a file structure soon. Just need to write it up

tball-dev commented 3 years ago

Schema Registry POC

https://github.com/Twb3/kafka-gitops/commit/50fa5ccb54ec77ee618d4152436a66804783fa9c

Proposed Schema State File Structure

schemas:
  order-value:
    type: Avro
    file: order-schema.avsc
  order-2-value:
    type: Avro
    file: order-schema.avsc
  shipment-value:
    type: Avro
    file: shipment-schema.avsc
    references:
      - name: order-value
        subject: order-value
        version: 1

Each schema entry above is the name of the subject to be registered in the Schema Registry. I did this to keep it similar to how topic and service entries are the name of the resource to be created.

Type is self-explanatory although this POC is restricted to only Avro, because I don't know how to parse PROTOBUF yet. (I couldn't find good examples in the confluent schema registry client either)

File is a reference to the schema file located at SCHEMA_DIRECTORY.

Config

Config is handled via environment variables:

SCHEMA_REGISTRY_SASL_JAAS_USERNAME\ SCHEMA_REGISTRY_SASL_JAAS_PASSWORD\ SCHEMA_REGISTRY_URL (default is http://localhost:8081)\ SCHEMA_DIRECTORY (absolute path to directory of schema files) (default is System.getProperty("user.dir"))

Login module is currently hardcoded to org.apache.kafka.common.security.plain.PlainLoginModule

Things to discuss

Schema differences

To know if a schema needs to be updated, I am parsing the schema file and generating a diff using zjsonpatch. I chose to use zjsonpatch because it returns differences by json node rather than the entire file. For example, the content of your schema could be identical, but you could have rearranged the order of nodes. The more I think about this while I type it makes me think it's not necessary. Ultimately, this part still needs work.

Deletion

Schema Registry allows us to soft-delete and permanently delete. I think we would want to always permanently delete since we want our state file to represent exactly what is deployed. This is how my code currently works. I believe this also deletes all versions.

Validation

For validating schemas I did more than just validate the yaml is valid. I check that the schema file exists at SCHEMA_DIRECTORY and I use methods from Confluent's Schema Registry Client to validate that the Avro schemas can be parsed. As a result of this, when you validate a schema with references it will actually make a call to your schema registry to validate that the schema you want to reference exists. This may go too far for validation. Just need some input.

HSA72 commented 3 years ago

Nice. Avro is a very good start. Just make sure that this also will work against schema registry in confluent cloud. When can we start testing? :)

jrevillard commented 3 years ago

Dear @Twb3,

This seems really promising thanks !

Yes Avro is a good start and the final goal would be to support: Thrift, Protocol Buffers, and JSON Schema. You say that you use the Confluent's Schema Registry Client to validate that the Avro schemas, so I think that this library would be capable of validate the other types isn't it ?

Concerning config, I could contribute with Kerberos as I will need it :-)

Best, Jerome

HSA72 commented 3 years ago

@Twb3 How is this feat going? Is it something that is stable enough to start using? I am very eager to have this as soon as possible.

jrevillard commented 3 years ago

Hi @Twb3 @HSA72 @devshawn ,

I don't know if you were aware of this: https://github.com/domnikl/schema-registry-gitops

jrevillard commented 3 years ago

There is one thing which is complicated for me to answer which is: how to deal with schema ids and versions ? Indeed, those IDs are generated server side and are used by the Kafka clients to identify the good schema. This means that there is no way to ensure the a schema will have the good ID/version and therefore, kafka-gitops cannot be the source of trust for this isn't it ?

jrevillard commented 3 years ago

As promised, you can find more than a POC implementation in #76 !

Please comment, improve etc...

tball-dev commented 3 years ago

@HSA72 I apologize for not following up on this sooner. I have not had the opportunity recently to dedicate time to this feature.

@jrevillard Thanks for posting that link to the schema registry gitops implementation! Looks promising.

tball-dev commented 3 years ago

As promised, you can find more than a POC implementation in #76 !

Please comment, improve etc...

Nice I will take a look!

devshawn commented 3 years ago

@jrevillard I wasn't aware of that project - pretty nice. I still like our approach of putting it into this tool. Maybe their owner would like to help contribute as well?

I'll let you and @Twb3 take the lead on this and then give suggestions and take a look at the POC shortly.