Kotlin / kotlinx.serialization

Kotlin multiplatform / multi-format serialization
Apache License 2.0
5.41k stars 620 forks source link

Boxed primitives for value classes in structures #2738

Open Chuckame opened 4 months ago

Chuckame commented 4 months ago

Describe the bug Given a serializable data class having a value class wrapping a primitive type, the field is serialized using .encodeSerializableElement(serialDesc, 0, (SerializationStrategy)myData.$serializer.INSTANCE, myData.box-impl(self.value)) instead of encodeInlineElement(serialDesc, 0).encodeInt(self.value), forcing primitive values to be boxed, and also doesn't follow the expected as said in the docs:

    /**
     * Namely, for the `@Serializable @JvmInline value class MyInt(val my: Int)`,
     * and `@Serializable class MyData(val myInt: MyInt)` the following sequence is used:
     * ```
     * thisEncoder.encodeInlineElement(MyData.serializer.descriptor, 0).encodeInt(my)
     * ```
     *
     * This method provides an opportunity for the optimization to avoid boxing of a carried value [...]
     * ```
     */
    public fun encodeInlineElement(
        descriptor: SerialDescriptor,
        index: Int
    ): Encoder

To Reproduce Check the bytecode generated from this code:

    @Serializable
    data class Structure(
        val primitiveField: BooleanValue,
    )

    @Serializable
    @JvmInline
    value class BooleanValue(val value: Boolean)

And here is the generated serializer (only the interesting part) retrieved using Kotlin bytecode IDEA tooling:

output.encodeSerializableElement(serialDesc, 0, (SerializationStrategy)ReproduceBug.BooleanValue.$serializer.INSTANCE, ReproduceBug.BooleanValue.box-impl(self.primitiveField));

Expected behavior Should call encodeInlineElement and generate the following (not sure for the nullable value class' field):

output.encodeInlineElement(serialDesc, 0, (SerializationStrategy)ReproduceBug.BooleanValue.$serializer.INSTANCE).encodeBoolean(self.primitiveField);

Environment

EDIT: shortened the description

Chuckame commented 1 month ago

Hello there, I tried to understand where could be the issue, but the plugin code is quite complex.

I think the issue is that the value class is not handled as an inlined element. I suppose the issue is around here https://github.com/JetBrains/kotlin/blob/f9de685b2feebf5e7021b7e769ae5751cde7dc46/plugins/kotlinx-serialization/kotlinx-serialization.backend/src/org/jetbrains/kotlinx/serialization/compiler/backend/ir/SerializerIrGenerator.kt#L266

Do you have any idea on how to fix it ?

sandwwraith commented 1 month ago

Yes, I've re-checked the code and it is indeed is not generated. Likely you can change functions being called in the formEncodeDecodePropertyCall or serializeAllProperties.

JakeWharton commented 1 month ago

We also experienced this performance problem. It gets even worse if you use value classes around other value classes (such as an Id wrapping a UInt) as the boxing seems to become exponential.

The idea that the library should unconditionally unwrap value classes is surprising, though. Is that only done if there is no custom serializer? What if the type is in another compilation unit and its serializer could change without recompilation?

I was coming to ask for some kind of opt-in to the unwrapping behavior when I discovered this issue.

For now we've migrated away from unsigned types and manually unwrap/re-wrap inside the carrier types.

e.g.,

@Poko
@Serializable
public class Create private constructor(
  @SerialName("id")
  private val _id: Int,
  @SerialName("tag")
  private val _tag: Int,
) : Change {
  override val id: Id get() = Id(_id)
  public val tag: WidgetTag get() = WidgetTag(_tag)

  public companion object {
    public operator fun invoke(
      id: Id,
      tag: WidgetTag,
    ): Create = Create(id.value, tag.value)
  }
}

This still affects toString() and discoverability of invoke is abysmal, but for us no users interact with this layer so it's an acceptable trade-off.

Chuckame commented 1 month ago

@JakeWharton custom serializer = on your own, so you are the owner of how to serialize, using directly encodeInt(42), or using object serializers like Int.serializer().serialize(42) which induce auto-boxing.

[...] its serializer could change without recompilation?

You missed the purpose of kotlinx-serialization plug-in when annotating with @Serializable: generate the according serializers code. So if no recompilation, no change in the generated serializers.

But still, it's interesting that you experienced big performance issues and found a "solution".


Anyway, thanks @sandwwraith, that could be an interesting day to dive a bit in the serialization plugin, I'll try to play with formEncodeDecodePropertyCall or serializeAllProperties as you said.

JakeWharton commented 1 month ago

There's so much to unpack in your comment.

Custom serializers do not mean you are "on your own". Switching to and/or from a custom serializer on the type definition from a separate compilation unit should absolutely change the behavior of callers without recompilation. This fact should immediately preclude the strategy of the plugin automatically unwrapping on its own.

Moreover, the backing property is not necessarily public nor would its type necessarily be intrinsically representable in the serialization format. Removing the ability for a value class to dictate its own serialized form despite being required to be annotated as @Serializable is a ridiculous deviation in ones mental model on the behavior of the plugin-generated code.

But sure, let's say we want to completely ignore this fact and split the behavior of the library for value classes. There are so many implications to this:

  1. Why am I required to annotate the value class with @Serializable if it is going to be unwrapped?
  2. Migrating an identity class within my compilation unit to a value class can silently cause changes in the serialized form.
  3. Migrating an identity class in a library to a value class will silently change the serialized form only when downstream compilation units recompile. This means a runtime classpath update of the library will not cause a change, but a recompilation will. Yuck.
  4. The behavior of stdlib value classes such as UInt explicitly violate this behavior. They need changed to serialize the signed representation of the two's compliment bits rather than as unsigned bits. And if you argue that UInt should not behave that way, you're now not only arguing that we should split traditional classes vs. value classes but also that there's a carve-out of a subset of the stdlib's value classes to make them serialize like regular classes and that mechanism is not available for your types.

The behavior also completely breaks with the presumed forthcoming concept of multi-property value classes. You could possibly try to argue that the behavior should only be for @JvmInline-annotated value classes, but then that means the serialization behavior changes depending on which compiler backend you are using for the current compilation which is the worst of all worlds.

The library absolutely needs the ability to serialize value classes, but there are multiple axes to the behavior which are not being adequately handled today.

First, the unwrapping of a value class to directly serialize a single, public backing property should be something the enclosing type can choose to do on a per-property basis and it should only work on value classes with a single, public backing property without it needing to be marked as @Serializable.

Second, value classes should be able to define their own serialized form using the normal @Serializable annotation and its features. The default can even be to unwrap to the backing property (assuming there's only one) and will now work if the backing property is not public.

Both of these behaviors would allow custom value class representation without the enclosing class to do any work. It will also allow multi-property value classes to seamlessly integrate into the system with the only new concept being that its default representation is an object with multiple key-value pairs rather than unwrapped to its single-property scalar value.

Additionally, these two behaviors also retain the current behavior of the library making it a compatible migration, unlike the current proposed solution to this bug to skip the serializer.

Finally, as a bonus, and the superior solution to this bug in my opinion: it would be nice if serialization was specialized to the primitive storage types to avoid boxing in the case of delegating to the value class's serializer (similar to how the stdlib specializes things like iterators for primitives). I think it's unclear if this is viable strategy long-term, but it definitely solves the case of user value classes wrapping Int, UInt wrapping Int, and user value classes wrapping UInt.

At this point, this should probably be its own bug...