confluentinc / confluent-kafka-dotnet

Confluent's Apache Kafka .NET client
https://github.com/confluentinc/confluent-kafka-dotnet/wiki
Apache License 2.0
53 stars 857 forks source link

Dotnet serializers could be written to be much more efficient with higher throughput #1844

Open yinzara opened 2 years ago

yinzara commented 2 years ago

Description

Right now, all the Dotnet built in serializers: ProtobufSerializer, AvroSerializer, and JsonSerializer all implement the SerializerAsync function in a similar fashion, they use a semaphore to block while they call schema registry to retrieve/update schemas based on the subject then use the data they retrieve to serialize the message.

Additionally all 3 are classes that take in a single generic argument that indicates the type of message/object they're supposed to be used to serialize. This would imply that each instance was only intended to serialize a single type of message since it's a required type parameter of the serializer.

However when you read the actual code, because of this whole "semaphore locking for schema registry" we're assuming that we have to get the "schema id" of each message when in actuality, we really should only be getting it once. It's not like the "schema id" can change with a single instance anyway, right?. It would have to be a new instance to warrant that.

This means we're unnecessarily locking on a semaphore and calling to schema registry (and yes I know "CachedSchemaRegistry" makes a lot of this less of a big deal) with every message serialization when we could actually be skipping it on most messages. If we just kept the last "subject" as a local variable of the class and verified that the next message we're serializing is the same "subject" and the "schema id" is not null, then no need to use a semaphore and lock and call schema registry. We can assume we've done that already.

You can still support the generic I want to use a ProtobufSerializer<object> because the validation that the subject is the same will fail with each message and it will continue to function with the current behavior of calling schema registry to get the "schema id".

This would make all the serializers MUCH more efficient serializing the first message after the initial one. In fact, I could see this being useful for the other language implementations of serializers as well.

I only bring this up because message throughput in publishing and limiting overhead of each of those publications should be a tenant of any library that publishes Kafka messages. Kafka is meant to be a low latency high throughput messaging system and adding in this whole overhead around schema management in the serializers is unnecessary overhead.

Thoughts?

mhowlett commented 2 years ago

Good chance you are right, but i need to work through it. We also want to make the framework more efficient generally (have an option to use span etc). This was cut from the original scope in order to get something out. I'm also aware that somewhere along the line ProduceAsync started to have a significant performance regression (Produce is still fast though). [edit: that in fact, turned out to be an issue in the benchmark test code]

Putting time into this is just an issue of prioritization.

RagingKore commented 2 years ago

@mhowlett if the community can help in any way, please create some issues with a specific tag.

The json serializer in particular could have a System.Text.Json alternative that would be much more efficient and faster.

mhowlett commented 2 years ago

at the time the JSON serdes were written, there was no real way to do schema validation with System.Text.Json. I'm not sure if that has changed.

douglasg14b commented 2 years ago

What if we don't care about schema validation...?

KalleOlaviNiemitalo commented 2 years ago

It's not like the "schema id" can change with a single instance anyway, right?

With AvroSerializer\ or AvroSerializer\, each record instance can have a different schema.

With useLatestVersion == true, I suppose the serialiser should recheck the schema registry when the subject name changes (e.g. if the same serialiser instance is being used with multiple Kafka producer instances that have different Kafka topics) and every now and then (in case the producer is long-running and a newer schema has been registered). I think a fast path can be added but it's not as simple as just always keeping the schema ID from the first use.

mhowlett commented 2 years ago

What if we don't care about schema validation...?

then it's easy to just use System.Text.Json pre serialization / post deserialization

pmgreg3 commented 5 months ago

I'm also aware that somewhere along the line ProduceAsync started to have a significant performance regression (Produce is still fast though). [edit: that in fact, turned out to be an issue in the benchmark test code]

@mhowlett, when you mention the benchmark test code, is this published anywhere? It would be valuable to see the performance benchmarks of the dotnet library that folks are building on.