Non-sequential encoding

Chuckame commented 4 months ago

What is your use-case and why do you need this feature? Currently, a decoder is able of retrieving the element index to decode, that is really powerful when we want to decode encoded data in a different order than the current element order.

Now, if we need to encode the elements in a different order than the descriptor's one, then we have to make complex implementation to buffer the encoded data, by buffering the encoded data and then really encoding it using endStructure.

For sure it works, but it feels like kotlinx-serialization is missing a part of the sequentiality feature, where we are able to select if we want sequential decoding or not, but not for the encoding part.

Describe the solution you'd like By looking at the generated code made by the serialization plugin, it seems quite "simple" to implement non-sequential encoding as exactly the same way as decoding : a method encodeSequentially that returns true by default for backward compatibility.

When this method returns false, a method getElementIndexToEncode is called unless it returns -1 (encoding done). For each matching index, it encodes the corresponding element's value.

Side note: I've seen some issues about a way to reorder fields, rejected because this job is "do-able" with the current status of the library. But why decodeSequentially and decodeElementIndex exists as it is also still possible to implement in sequential-only way?

For sure I know it's also a feature request for the serialization plugin, which I'll do it.

To finish, having this feature would simplify all the encoders that needs to reorder fields or skip fields, without implementing complex buffering or transformers. It would also make native the AbstractEncoder.encodeElement(descriptor, index): Boolean natively handled as the non-encoded element index wouldn't be returned at all (also a little more performant I suppose). For the last drop-the-mic argument, it will make the sequential mode "parallel" as it will be available in encoding and decoding 🚀

Chuckame commented 3 months ago

Hello, I've seen many issues handled/answered, but not this one, can you have a look on it? 😇

pdvrieze commented 3 months ago

@Chuckame There are two separate aspects here that you seem to confuse:

The ability for a serializer to serialize out of order: This should be supported already by most formats, there is no need to add features to CompositeEncoder to add this capability (serializer driven reordering is already possible). Well-written encoders should not assume any specific order for elements to be written.
The ability to reorder elements with some annotations, or for format specific reasons (e.g. XML attributes come before member tags). Given the current architecture it seems that this can only be a request from the encoder to the serializer (as for compatibility the old versions need to stay supported). This ordering would be format-driven, not serializer driven.

Then decodeSequentially is an optimization that deserializers can use to simplify decoding where this is also supported by the format. This is fundamentally still driven by the deserializer (although it needs to be supported by the format). If you were to have an analogue for encoding this would be the current behaviour (the burden of complexity is in the formats, not the serializers). Fundamentally decodeSequentially is a hook for performance optimization.

Some thoughts:

For the close future the library must retain support for serializers that do not allow reordering. (thus removing one of the use cases where it is required for the format).
The implementation must be inside the serializer (and is more complex). In particular it would need to be able to enquire which element to write next - do we then want the serializer to keep track of that as well
If the encoder does the reordering itself, this just means retaining the element value and serializer and then serializing it later. (there is some limited overhead, but not a lot)
Separately it may be an option for reordering to be done at the time of generating the serializer (but this would still not support format specific ordering)

Overall, I don't think this feature would actually be able to achieve much wins. Formats will still need to be able to handle reordering (where required or desirable), but every serializer that wants to support reordering would need to be updated to support it explicitly.

An alternative that might work better is to provide a base class for compositeEncoders that provides the reordering functionality for multiple formats (this would still need some format-specific hooks). But again, such functionality is more auxiliary than that it needs to be in the core library.

Chuckame commented 3 months ago

It seems you misunderstood (I surely misexplained 😄).

To summarize, currently we are able to read out of order (serialized order different than kotlin data class elements order) natively thanks to decodeSequentially set to false, but we aren't able of encoding in another order than the kotlin data class elements order. Many formats needs to reorder the elements when encoding or decoding (XML tags as you said, avro fields, json ordering in alphabetical order,...).

Why not provide natively a way to encode in a different order? This would be a big win, also regarding performances especially in serialization contexts where we expect high performances.

So if this is not a win as you said, why kotlinx-serialization already have a non sequential decoding?

Regarding the bigger encoding complexity, yes it will be inside the serializer, but it is exactly the same complexity as the non sequential decoding

Chuckame commented 3 months ago

In the close future, removing the Non-sequential feature is for me a big mistake as a serialization layer must allow the highest possible performances. By removing it, yes the complex logic will be inside the encoders implementation, but it will force them to make a 2-steps encoding and decoding to adapt the kotlin data classes to the expected serial representations.

Even in protobuf, unordered encoding is possible as each field is prefixed with its field number. With only sequential decoding, you are not able of deserializing unordered raw without using a map (or similar) to put in the good elements order.

Scheme formats needs reordering to allow performant 1-step serialization in case of schema changes (that is the norm in big companies)

pdvrieze commented 3 months ago

The reason for nonSequential decoding is because the data may not be presented in the order in a serializer (serialized datatype). If you don't want deserialization to be very brittle it is needed to support reading out of order. Hence out of order deserialization is the default. This can not be done just by the decoder as the child deserializer is needed to properly decode the value (and child deserializers are not exposed in an indexed way). On the other hand serialization is controlled by the data class, and the format can intercept it. By default formats will just serialize in the order presented by the serializer (generated or custom). For the fastest performance that is the way to go. Possibly using annotations that support reordering by the generated serializer (rather than declaration order as now).

The other option is to do reordering in the CompositeEncoder. This can be done fairly fast:

On creation the CompositeEncoder allocates an array for the values (based upon the descriptor and count)
The encoder also creates/looks up a reorder array (mapping old indices to new) - this can be cached for speed.
For each encodeXXXX call, this will be recorded as serializer/value pair in the value array (using the mapping to change the index) - for primitives PrimitiveSerializers can be used as primitive
on endStructure the encoder serializes all values that were set. It is important to consider that this reordering will have an overhead, but this will always be present.

Alternative 1 is as follows:

The compositeEncoder is created for the given serializer that supports reordering, (a beginReorderableStructure function may do this).
The compositeEncoder has a reorder map (like the other one already present)
For each element, the compositeEncoder calls serializeElement(compositeEncoder, index, structValue)
The serializer then uses index and structValue to call `compositeEncoder.encodeSerializableElement(descriptor, index, elementSerializer, structValue.elementValue)

Alternative 2 (not requiring SerializationStrategy change) is as follows:

the serializer calls beginStructure (and receives a CompositeEncoder)
the serializer calls isReorderingRequired on the composite encoder
in a loop the serializer calls nextElementToEncode until it returns a negative value (any >=0 value is the index to encode)
the serializer uses the retrieved index to encode it as it would normally do (note that it is permissible to skip the encoding if the value would normally not be encoded - ie. it has the default value)

Of the two alternatives, alternative 2 would be the most feasible to be implemented, but would still:

require an update to the generated serializer for the serializer to support it (also true for alternative 1)
require a different CompositeEncoder instance per structure (depending on the format this may be needed anyway)
require the serializer to also support "sequential" encoding
require the CompositeEncoder to support encoder-only reordering (only in those cases where reordering is mandatory - eg. XML attributes occurring before child tags)
require the compositeEncoder to somehow determine the desired order and how this maps to the original order (this can be cached and when stored as mapping array has O(1) lookup cost)
is only needed where it is the format/encoder that determines the order (if it is the serializer, no additional support is needed)
require the additional overhead in each generated serializer
have marginal efficiency benefits (the encoder internal strategy will need to use the primitiveSerializers - likely inlining will apply here)

My conclusion is that while technically possible, there is a significant overhead to even alternative 2, and it does not resolve mandatory reordering (in which case the old way still needs support so reordering is now done in two separate ways). Any efficiency gains are going to be marginal - the most efficient still being to have serializers order the elements by themselves. The main gain is found in memory, and not not needing to duplicate the encoded elements - the cost is two references (serializer, valueRef) and the actual value (in case of primitives) - only really an issue with deeply nested hierarchies of large classes.

As to decoding, the way this works in the generated deserializer is that it has a field for each element of the class, and then once decoding the elements is complete, calls a special (deserialization) constructor with those field values as parameters (after checking they are actually set). decodeSequentially does have an advantage in that it means the deserializer doesn't need to repeatedly call decodeElementIndex. For encoding, that is already the default.

sandwwraith commented 3 months ago

I get the idea, but I'm not sure how much performance gain it would actually provide. See, to provide a serializaiton order, you anyway need to: 1. Analyze SerialDescriptor in beginStructure. 2. Build some internal representation of order (e.g. IntArray with indices in your required order). Then, you need to also keep track of your structure to return the next index from getElementIndexToEncode. So you won't avoid the need to upkeep some order tracking in your format implementation at all. You probably will escape the need to write data to the internal buffer/intermediate data structure before sending it down the sink. Depending on this intermediate data structure, this can be an improvement or can be totally insignificant. I tend to think that unless it is a super-fast binary format for whom joining buffers may be problematic, it will be more on the latter side. As far as I remember, for e.g. XML we still want to build some intermediate representation even if all properties are ordered as we want.

Besides, there are also concerns about an increase in compiled bytecode size (which is quite big already, to be honest), and the fact the large methods are badly compiled by JIT. Also, formats rarely make use of decodeSequentially, so that part may not be needed as well.

Chuckame commented 3 months ago

As always, @sandwwraith @pdvrieze thanks for the constructive and complete feedback.

The major performance issue where about serializing unordered binaries, so 2 solutions: buffering the field values and then serialize them in the good order in endStructure, or serializing values in temporary buffers, and then write the buffers in the good order.

Both of the solutions seems to be performant, but will always need arrays allocation for buffering.

If you don't want to add this feature, I can try to provide a CompositeEncoder for reordering if you want ? (instead of changing the compiler generated code).

sandwwraith commented 3 months ago

or serializing values in temporary buffers and then writing the buffers in good order.

This seems to be a more viable solution. Storing field values alone will lead to boxing in most cases unless you implement separate storage for each primitive type. In the meantime, kotlinx-io/okio implement zero-copying buffer joining. It is still sub-optimal, since we would have a lot of small buffers instead of a one large, but it may be faster than copying buffer every time.

Chuckame commented 3 months ago

And what do you think about storing the "encoding action" instead of storing the field value or a buffer ? That way we do not depend on the low level buffers or serialization layer. It also does not auto box, but I don't really know the performance impact of this compared to encoding to buffer then reordering the buffers. Here is an example of what I'm doing to reorder fields when encoding:

it's able of reusing CompositeEncoder transparently, so there is only one encoder which encodes a data class

EDIT: Did a PR to have the concrete code

pdvrieze commented 3 months ago

When you do this, you don't want to inherit AbstractEncoder as it doesn't make much sense reordering to be done on Encoder (that is for single values, not for structures). This would also make that bufferEncoding can have the following signature: protected inline fun bufferEncoding(elementIndex: Int, CompositeEncoder.() -> Unit) with usage like:

override fun encodeStringElement(index:Int, value: String): Unit = bufferEncoding(index) {
    encodeStringElement(index, value)
}

There should be no need to much about with generics, but it would be able to work. As to the encoder used in endStructure you would need to choose to either reuse the current one (with a state), or to have regular composite encoder (possibly passed as constructor parameter). It is probably best to try to measure what works best.

Chuckame commented 3 months ago

See the pr, the generic T is not here anymore

Kotlin / kotlinx.serialization

Non-sequential encoding #2668