confluentinc / schema-registry

Confluent Schema Registry for Kafka
https://docs.confluent.io/current/schema-registry/docs/index.html
Other
2.22k stars 1.11k forks source link

Document how to serialize Scala classes to Avro without reflection #1826

Closed Oduig closed 3 years ago

Oduig commented 3 years ago

I would like to send a generic case class to Kafka using KafkaProducer, using Avro serialization with the Confluent Schema Registry. The documented approach assumes that one wants to use reflection and avro4s to generate an Avro schema, which is then submitted to the registry.

In our situation, we have a registry which already contains the appropriate schema. Although the fields are the same, this schema is not identical in its metadata (name, namespace, etc.). How can we serialize our model objects using schemas from the Schema Registry, and send them over Kafka?

Code samples here: https://stackoverflow.com/questions/67004284/sending-pojo-to-kafka-with-pre-defined-avro-schema-in-schema-registry

Since this seems like a straightforward approach to using Schema Registry, I am posting it here as an issue. Documenting this would likely help many developers to get up and running with SR.

OneCricketeer commented 3 years ago

I'd commented on your Stackoverflow post, so would the solution of not creating case classes work for you? e.g. download and generate instead?

More specifically, since this repo has no examples outside of Java code, why limit examples to that? What about other JVM languages?
What's preventing your Scala project from using generated Java classes (can be done using Maven if not using sbt plugin above)?

I feel like the issue that should be addressed is the Avro4s model definition not matching the schema you've stored (can you give an example, and have you opened an issue with the Avro4s project about this?)

Oduig commented 3 years ago

I'd commented on your Stackoverflow post, so would the solution of not creating case classes work for you? e.g. download and generate instead?

More specifically, since this repo has no examples outside of Java code, why limit examples to that? What about other JVM languages? What's preventing your Scala project from using generated Java classes (can be done using Maven if not using sbt plugin above)?

I feel like the issue that should be addressed is the Avro4s model definition not matching the schema you've stored (can you give an example, and have you opened an issue with the Avro4s project about this?)

Thank you for your time, appreciate it! You are correct in that generating these classes would work, though I feel like it is a complicated solution for a relatively simple problem. There are 3 small reasons why I think having a serializer for regular classes is preferable.

  1. Like many projects, we have a domain model approach where we receive data externally, map it to our model, and then want to push this data to Kafka. With generated classes, we don't want to use these as our domain models as they are coupled with Avro logic. Mapping our domain models to (almost identical) Avro classes feels like an extra step that seems unnecessary. Why not serialize directly and get an error if the schema doesn't match the code?
  2. Our schema registry requires basic authentication, but we do not want to check our passwords into git. This means we should find a way to enable/disable class generation in build.sbt depending on an environment variable, and have every developer who wants to generate classes set an extra environment variable for the build process.
  3. Everyone understands a case class/POJO, whereas generated code is not always understood and can be mistreated, e.g. someone can use a refactor operation or modify it without realizing that it's generated, someone may be stuck because they're not sure how to re-generate this class, etc.

To summarize, we can use SBT code generation but it feels like a bit of a workaround compared to the following straightforward solution:

  1. Define a schema in Schema Registry (API-first mindset)
  2. Set up a Kafka client with Schema Registry
  3. Construct a case class, normal Scala class or even just a POJO
  4. Send the object to a topic (automatically using the schema to serialize to Avro)

I think this would make things a lot simpler, easier to implement on both sides right? Is there a reason why we would not do it this way?

OneCricketeer commented 3 years ago

You seem to be missing the fact that only IndexedRecord subclasses can be serialized to Avro, not standalone case classes / POJOs.

Secondly, if you define the schema first, anyway, then it's possible your manual class definition will diverge, and therefore generation should be preferred instead of depending on runtime errors after you've already built the project, published its artifacts, and deployed them. In some environments, that feedback loop can be several days long. You can store env vars in CI/CD systems and refer to them in sbt during build; those get pushed as versioned code dependencies as an extension of the versioned schemas in the registry. In maven, we use profiles to opt in to these features, not only env vars.
Generated classes shouldn't be output in your src/main/scala, for clear separation, and would include javadoc that states its generated and not meant to be modified. Any file in the generated folder should be ignored from VCS, so that in CI/CD before compilation, classes are generated and therefore cannot be modified, anyway.

Finally, without ReflectData, no schema is automatic, but as you've stated, you'd like to follow schema-first design

Oduig commented 3 years ago

That sounds like a good setup, and clearly the recommended one at this time so I'll close this. Still, the mechanism you describe is a lot more complicated to set up. For future reference, perhaps the proposed setup is worth taking into consideration.

Serialization to JSON or XML is possible out of the box with a POJO or case class, and Avro has a clearly defined schema on top of this. I don't yet see why having a dedicated IndexedRecord is technically required.

Secondly, divergence of a manual class is not a problem if we specify the version of the schema in a config file, and check during serialization that the fields match.

OneCricketeer commented 3 years ago

Serialization to JSON or XML is possible out of the box with a POJO or case class

As mentioned, that would require using reflection. Notice that the Widget class is a simple POJO - https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/test/java/io/confluent/kafka/serializers/KafkaAvroSerializerTest.java#L729

don't yet see why having a dedicated IndexedRecord is technically required

Exception is thrown if reflection is not used and class is not a subclass of IndexedRecord, which can be done from generated classes

https://github.com/confluentinc/schema-registry/blob/master/client/src/main/java/io/confluent/kafka/schemaregistry/avro/AvroSchemaUtils.java#L143