question: can i read arbitrary avro messages?

mikedewar commented 4 years ago

I'd like to read arbitrary avro messages from Kafka, where the topic is specified at build time, and the avro schema is pulled from our schema registry.

I thought I'd almost got it by using the compiler.CompileSchemaBytes(schema, schema) where schema is retrieved from the registry and then vm.Eval to deserialise the message.

However I couldn't figure out a way of persuading vm.Eval to accept something vague for its target argument. From what I can tell I have to specify a target using New<RecordType>() which is tricky as I don't know <RecordType> ahead of time.

To explain more, I could change the generator (let's pretend i could do this) to create NewRecord() without using the <RecordType> in the name of the function and then I could call

err = vm.Eval(&buf, deserialiser, NewRecord())

and I think I'd have what I need. However this feels wrong and I imagine isn't something that you guys would want in your codebase.

What am I missing? Is there a better way of achieving the above? Or should i just suffer the indignity of using map[string]interface{} even though I've a perfectly good schema sitting in the registry?

dangogh commented 4 years ago

I need the same thing, and it looks like #86 outlines a way to enhance gogen-avro to do it. I'm contemplating taking that on, but can't ATM.

actgardner commented 4 years ago

I'd like to read arbitrary avro messages from Kafka, where the topic is specified at build time, and the avro schema is pulled from our schema registry.

This is a little ambiguous, I can see two different scenarios: 1) your records are completely arbitrary and you're simply iterating over all the fields with no application logic, for example to convert records from avro into a different format 2) your records are tied to your application logic - there's a fixed set of records your app can handle, and the app needs to be aware of the fields in the record to perform some set of operations

For scenario 1, gogen-avro is never going to be appropriate. Something like the (https://github.com/linkedin/goavro) will give you the map[string]interface{} API you described, with no compile-time type safety guarantees. goavro is a great project and it's very flexible.

In scenario 2, gogen-avro may be appropriate. gogen-avro is designed to work in two passes:

at compile-time the command-line tool generates the Go structs that you use in your codebase
at run-time the VM deserializes compatible Avro data into the generated structs

If your application logic needs to be updated anyways to handle new records/fields, then your development workflow would involve updating the application and generating new structs for the updated schemas at the same time. This would give you type safety when you're building out new features based on new records/fields.

Issue #86 specifically is about interacting with the Confluent schema registry as a source of the schema to pass to vm.Eval. It doesn't involve dynamically generating Go code at run-time or bypassing the compile-time step.

I know this is a little abstract, if you can provide more detail or an example of the desired API (something in-between generated structs and map[string]interface{}) I might be able to help further.

dangogh commented 4 years ago

I'm not the originator of this issue, but I'd like to provide my perspective. I'm not particularly interested in mapping to any arbitrary schema. I'm interested in evolving a schema. I expect the records on a particular topic to be one of the known versions of that schema and to be able to take the record, determine which version of the schema it maps to and deserialize it. If it's not a recognized version of that schema, then the reader should not allow it to continue.

I think this is doable with the current structure of gogen-avro.

I hope that's helpful..

mikedewar commented 4 years ago

@actgardner thankyou for your response! I really appreciate the attention =)

My scenario is definitely scenario 1, except that I'm happy to generate an application tied to a particular schema e.g. the service itself doesn't need to read arbitray avro at run time, and I would love to avail myself of the type safety guarantees having a well defined struct brings.

I also can use a well defined struct downstream (in our case one example is writing to Parquet files) in a streamlined and rapid way that I can't do with arbitrary map[string]interfaces.

Right now we're exploring generating a .go file outside of the compile step using gogen-avro and storing it with the avro schema in its own little git repo. We then manually modify it to follow a convention e.g. NewRecord rather than New<myType>. Then, at (well, before really) compile time we pull in the .go file and the schema as a subrepo and build out the service for a particular topic. It's a very clumsy process at best!

But the upsides are:

we have one small codebase for our archiver which can easily be tailored and compiled for each topic,
it's very efficient as there's less run-time mucking around at least as far as we can tell
my schema is tightly enforced at the API, in kafka, and in parquet with only the API schema and the Avro schema files.

Does this make sense?

actgardner commented 4 years ago

@mikedewar

Right now we're exploring generating a .go file outside of the compile step using gogen-avro and storing it with the avro schema in its own little git repo.

This sounds like the workflow I imagined - keeping the schemas and the generated code in a git repo you use as a submodule. This is probably the simplest way to make sure there's a generated struct for each schema (you can use pre-push hooks, etc. to enforce it as well) that can be shared across projects.

We then manually modify it to follow a convention e.g. NewRecord rather than New.

There's a lot of hidden "contracts" within the generated code where renaming or moving things might break it in unexpected ways. Hopefully we can find a solution that doesn't involve manually modifying the generated code.

Then, at (well, before really) compile time we pull in the .go file and the schema as a subrepo and build out the service for a particular topic. It's a very clumsy process at best!

So if I understand right the issue is the amount of manual tweaking involved in adapting the archiver service to each new schema?

One thing I've been considering is adding code to generate what I would call a "de-multiplexer" for a set of schemas - it would take some framed data (either Single-Object Encoding or Confluent Schema Registry in the future) and match it to the appropriate schema. The result would be an interface{} or a small interface we define - the resulting records could either be downcast to their generated types, or you could serialize them to JSON or some other format (parquet?).

An example of this in action for Single-Object Encoding:

func (demux *SOEDemux) Deserialize (reader io.Reader) (interface{}, error) {
  // Read the header to get the fingeprint (or schema ID for confluent)
  fingerprint, err := soe.ReadHeader()
  switch fingerprint {
    case "\x10\x20":
      return DeserializeRecordOne(reader)
    case "\x20\x30":
      return DeserializeRecordTwo(reader)
    // Add more cases for every schema we know about
  }
  return nil, fmt.Errorf("Unknown schema type")
}

Would this be suitable? You could then hand the resulting interface{} to your parquet writer? The demultiplexing code is simple and I could add that this week. I haven't looked at the Confluent schema registry in any depth right now and I'm starting a new job soon, so that piece may take longer. PRs are always welcome.

mikedewar commented 4 years ago

So I think this makes sense: a general purpose Deserialise function that we could rely on existing no matter what, that would switch on the fingerprint of the message. If I've understood it properly it would mean we could build one archiver (or whatever) service and it could consume any known topic? That'd be nice!

I'm less bothered about the Confluent registry than I was 9 days ago, as a colleague of mine has very patiently pointed out we can't rely on it existing, being reachable, or populated, at build time. So i think it appropriate that we get the schema before compilation one way or another, and not worry too much about where it comes from, at least from the point of view of consumers.

Would be happy to help if I can!

actgardner commented 3 years ago

Just a heads up, I've added support for generic data types in the most recent major release. This might serve your needs, let me know: https://github.com/actgardner/gogen-avro/blob/master/README.md#generic-data

plamen-kutinchev commented 9 months ago

Just a heads up, I've added support for generic data types in the most recent major release. This might serve your needs, let me know: https://github.com/actgardner/gogen-avro/blob/master/README.md#generic-data

Is the support for generic data still in beta or we can rely on working with it ? cc @actgardner

actgardner / gogen-avro

question: can i read arbitrary avro messages? #134