avro-kotlin / avro4k

Avro format support for Kotlin
Apache License 2.0
188 stars 36 forks source link

Bypass generic data step and serialize directly to binary #160

Open Chuckame opened 10 months ago

Chuckame commented 10 months ago

Currently, the library is just interfacing as it can to the official apache avro library using GenericRecord and other GenericData stuff.

On avro4k codebase simplicity, it seems easy while it's not really since the codebase is doing a lot of adaptation to let the GenericData happy with the generic stuff we generated.

On performance side, we are checking, converting and adapting data to fit GenericData stuff, and then this adapted stuff is also tried to be re-adapted and checked, to be then serialized to binary. Also, all is runtime specific while kotlinx serialization is mainly prepared at build time (except contextual).

On user side, each user that just want to do avro format (and not directly use the generated GenericRecords) will have to call the apache avro library to serialize it to binary or json, or to use the libraries helping this stuff.

On compatibility and spec-fitting, we are like covering all the avro spec without knowing how it works. We just put the record from Avro.toRecord() into the apache avro lib, and 🎉 it works. On tests, we are currently not in testing the avro format compliance, but the generated GenericRecords with another generic record.

And the last axis, kotlin multi platform. Since a big part of the codebase and the mechanisms is highly linked to the apache avro Java lib, the multiplaform dream seems complicated to reach.

Now, how to improve that? Let kotlinx serialization do his best and why it has been created : easily encode whatever the kotlin object. Have a look to the avro Encoder methods : https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/io/Encoder.java

All the methods fits exactly to kotlinx encoders, and the way of calling them fits perfectly with kotlinx workflow :

The other advantage is that there is absolutely no primitive autoboxing, also no more callbacks hell thanks to direct use of avro Encoder as output.

@thake, any comment on this? You are in this project since a long time. This will be a big refactoring, while it could be game changer.

Plan for making the changes:

thake commented 10 months ago

@Chuckame, thanks for the comprehensive issue for this really important topic! I also had this in mind and think this is the way to go.

I think it would be cool to get rid of the avro dependency and create an extension library that maps our types to avro types. This would open the doors for multiplatform of the core library while still supporting users that rely on avro types.

thake commented 9 months ago

Just as a little heads up, I've created a new branch https://github.com/avro-kotlin/avro4k/tree/way-to-multiplatform, which implements a PoC that directly encodes to avro binary without using the avro library. So far, everything looks very promising.

I've created a small benchmark that compares the speed of Avro4k (old), Avro4kDirect (new), and Jackson. Here are the results:

Benchmark                       Mode  Cnt       Score       Error  Units
Avro4kBenchmark.clients        thrpt    3   29557,032 ±  3002,497  ops/s
Avro4kBenchmark.users          thrpt    3  290324,243 ± 23988,233  ops/s
Avro4kDirectBenchmark.clients  thrpt    3  583662,180 ± 68148,244  ops/s
Avro4kDirectBenchmark.users    thrpt    3  425400,322 ± 29352,531  ops/s
JacksonBenchmark.clients       thrpt    3  239120,408 ± 22238,596  ops/s
JacksonBenchmark.users         thrpt    3  304450,006 ± 63307,328  ops/s

I will publish the code for the benchmark soon.

Open stuff for the branch:

Chuckame commented 9 months ago

So cool! And wait for my refacto to be merged, where it should improve a lot of little things 👼 And mostly the encoder/decoder architecture that was simplified

Chuckame commented 9 months ago

Can you create the PR as Draft to easily open discussions ? Done !

thake commented 9 months ago

The sources of the benchmark used for the table above can now be found at https://github.com/avro-kotlin/kotlin-avro-benchmark.

Chuckame commented 9 months ago

Can you add a little of polymorphism? With using the sealed root type as the serializer descriptor for avro4k? E.g Avro.decode(RootType.serializer(), childInstance) that way it have to lookup the good descriptor for the implementation during decoding. I think Jackson do the same, where you can force the root type to be used as the root serializer

thake commented 9 months ago

I don't actually get what you mean, sry. Can you provide a PR?

Chuckame commented 9 months ago

@thake Just a global comment on the recent work on avro4k, but we should talk about the changes we want to do inside the codebase and synchronise. Because as I can see, I reworked all the codebase as already said, but you are also moving a lot of stuff (tests, classes). I'm just afraid about rebases and duplicate work. Are you available in next few days to meet (discord, zoom, google meet, ..) to clarify and maybe make some roadmap ?

Chuckame commented 2 months ago

186 introduced encodeToByteArray and decodeFromByteArray so we are now able of implementing direct binary implementation without changing the API