dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.95k stars 4.65k forks source link

[API Proposal]: Introduce highly-efficient binary serialization abstraction #77875

Open geeknoid opened 1 year ago

geeknoid commented 1 year ago

Background and motivation

Binary object serialization is a critical feature of modern software, and the .NET ecosystem includes a number of different serialization models (protobuf, flat buffers, msgpack, etc). The variety of serializations models makes it difficult to create general-purpose abstractions (such as caches) which are independent of the specifics of the serialization format.

I propose introducing a base common abstraction to unify these different serialization models.

We've implemented this interface in our systems and it works rather wall. We currently support protobuf, protobuf.net, and flat buffer serialization using this interface, which we leverage in our caching infra. We haven't done msgpak yet, but it would fit right in as well.

In addition to the basic interface, there would also be implementations of this interface for the different formats (ProtobufSerializer, ProtobufNetSerializer, MsgPakSerializer)

API Proposal

namespace System.Buffers

/// <summary>
/// Defines an efficient format-independent model for binary serialization and deserialization.
/// </summary>
public interface IBinarySerializer
{
    /// <summary>
    /// Serializes an object to binary format.
    /// </summary>
    /// <typeparam name="T">The type of the object to serialize.</typeparam>
    /// <param name="value">The object to serialize.</param>
    /// <param name="destination">Where to write the serialized data.</param>
    public void Serialize<T>(T value, IBufferWriter<byte> destination)
        where T : notnull;

    /// <summary>
    /// Deserializes an object from binary format.
    /// </summary>
    /// <typeparam name="T">The type of the object to deserialize.</typeparam>
    /// <param name="data">The buffer of serialized data.</param>
    /// <returns>The deserialized object.</returns>
    public T Deserialize<T>(ReadOnlyMemory<byte> data)
        where T : notnull, new();
}

API Usage

Classic IBufferWriter model:

[ProtoContract]
internal sealed class CustomerPBN
{
    [ProtoMember(1)]
    public string? FirstName { get; set; }

    [ProtoMember(2)]
    public string? LastName { get; set; }

    [ProtoMember(3)]
    public string? Address { get; set; }

    [ProtoMember(4)]
    public int ApartmentNumber { get; set; }
}

internal static class Program
{
    public static void Test(IBinrySerializer serializer)
    {
        var originalCustomer = new CustomerPBN
        {
            FirstName = "Alfred",
            LastName = "Hitchcock",
            Address = "123 Hollywood Blvd.",
            ApartmentNumber = 42,
        };

        // get a buffer writer from the pool to serialize into
        var serializedData = new ArrayBufferWriter<byte>();

        // turn the Customer into a buffer of bytes
        serializer.Serialize(originalCustomer, serializedData);

        // turn the buffer of bytes back into a Customer object
        var deserializedCustomer = serializer.Deserialize<CustomerPBN>(serializedData.WrittenMemory);
    }
}

Alternative Designs

Using classic streams for serialization would be possible, but it would be considerably less efficient that IBufferWriter-based processing pipelines. We have combined IBufferWriter-based serialization with an IBufferWriter-based compression library which delivers very high throughput end to end.

Risks

No response

stephentoub commented 1 year ago

As noted offline, this and https://github.com/dotnet/runtime/issues/43669 overlap.

stephentoub commented 1 year ago

cc: @GrabYourPitchforks, @terrajobst

geeknoid commented 1 year ago

As noted offline, this and #43669 overlap.

Yes, since this API is based on a different model, I chose to file a separate issue. I looked at streams originally for this pattern, but found that the buffer writer abstraction is considerably more effective for high perf transformation pipelines:

Serialize -> compress -> encrypt -> pour out the network

Streams can do that too, but they tend to impose more overhead by their nature.

teo-tsirpanis commented 1 year ago

Can implementations of Deserialize hold on the given ReadOnlyMemory after the method exits? It would be useful for random-access binary formats.

stephentoub commented 1 year ago

Yes, since this API is based on a different model, I chose to file a separate issue

If we add an abstraction for binary serialization, I don't see us adding both.

But @GrabYourPitchforks and @terrajobst should weigh in on the whole topic.

omariom commented 1 year ago

The interface assumes that the binary representation of T is laid out in contiguous memory. Is it intentional?

Clockwork-Muse commented 1 year ago

The interface assumes that the binary representation of T is laid out in contiguous memory. Is it intentional?

... contiguous layout means you can stride a stream of them to (de)serialize an entire iterable collection. If it's not things are much harder.

terrajobst commented 1 year ago

In principle, I don't see a reason to objecting to this; at least intuitively it makes to me for offering something like it and it also seems it would only really add value if the abstraction is platform provided. However, with that also comes the hurdle: we have effectively only one shot at this so we better get it right.

My general stance on abstractions is:

  1. I want to understand how someone would consume the abstraction. I don't just mean code here; I'd like to understand the class of consumers so we can understand their expectations and requirements.

  2. We should make sure we have at least a few implementations of the abstraction to ensure it holds water.

What isn't entirely clear to me here is who is going to implement the abstraction: is it the type supporting its serialization or is this a third party that knows about the serializer and the type? I assume it's the latter because it's more flexible. This raises the question how a serializer can express which types it supports; the design as sketched here would imply that it is not static information.

And lastly we should think about how these abstractions gel with source generation. As sketched here it seems there is an assumption that it's runtime only, which might not be sufficient for us moving forward.

grbell-ms commented 1 year ago

Should this interface also support ReadOnlySequence<byte> for deserializing non-contiguous objects?

geeknoid commented 1 year ago

@terrajobst Good points.

The proposed design came about as part of our implementation of a high-efficiency caching pipeline. We wanted our customers to be able to store arbitrary Ts in the cache, but different customers use different serialization frameworks. So we couldn't mandate protobuf for example. With the abstraction, our customers can use whatever serialization framwork they want, and our cache doesn't care. An incoming T is turned into a bag of bytes in whatever way the customer wants.

The integration with IBufferWriter makes it possible to very efficiently compose serialization with other data transformations with minimal copying and allocations. We maintain pools of BufferWriter objects such that in general we can serialize, compress, encrypt, and send to the network all without any allocations (except for whatever the netstack is doing of course). Same thing when reading (this time except for the final allocation of the T and whatever objects it holds)

I've got working implementations of this interface for Protobuf, Protobuf.NET, and FlatBuffers, which I'd be happy to share. Adding a MessagePack implementation should be trivial and would complete the set of pri 0 formats.

hez2010 commented 1 year ago

I don't think we should use ProtoContract and ProtoMember for a general binary serializer. Instead, we should use some neutral names such as reusing the Serializable and introducing a new SerializableMember.

davidfowl commented 1 year ago

Shouldn't this also support async serialization? Is the assumption that everything is buffered in memory? I have other questions about how we would make this trimmable/AOT friendly but maybe the ideas from JSON can be used with this model to make that happen. Either way, I'd love to see an associated strawman.

Last but not least, we need a couple of binary serializer implementations to see if this abstraction works well:

geeknoid commented 1 year ago

@davidfowl Yes, the assumption is that the serialized state is in memory.

You are starting and ending with a T so at some point, the entirety of the state must be in the managed heap. The abstraction here implies that the serialization of this T must also fit in memory. In addition, the in-memory nature means that writing out to the network cannot happen until everything has been fully serialized, and deserialization can't happen unless everything has been fully read in from the network.

It's a trade-off. This abstraction is simple and highly efficient for the common case of relatively small Ts. But it can bog down with huge Ts. A Stream-based model is more expensive in the majority of cases, but may consume less memory and potentially allow concurrent processing for very large Ts.

stephentoub commented 1 year ago

It's a trade-off

Why is a trade-off needed? Presumably we only expect a relatively small number of implementations, such that it's ok to increase the work required for an implementor by increasing the number of members to be implemented. But if we really want to avoid that, we could also add the additional methods as virtual, with a functional base implementation that can be overridden to make it as efficient as possible, e.g.

public abstract class BinarySerializer
{
    public abstract void Serialize<T>(T value, IBufferWriter<byte> destination) where T : notnull;
    public abstract T Deserialize<T>(ReadOnlyMemory<byte> data) where T : notnull, new();

    public virtual void Serialize<T>(T value, Stream destination) where T : notnull {...} 
    public virtual T Deserialize<T>(Stream source) where T : notnull, new() {...} 

    public virtual ValueTask SerializeAsync<T>(T value, Stream destination, CancellationToken cancellationToken = default) where T : notnull {...} 
    public virtual ValueTask<T> DeserializeAsync<T>(Stream source, CancellationToken cancellationToken = default) where T : notnull, new() {...} 
}

Also, why is there a new() constraint on the T? Isn't that limiting?

All of these designs will have additional overhead from the generic virtual/interface methods.

In addition to multiple implementations and consumers as David and Immo called out, I'd like to add they should be in multiple domains. Focusing only on the object caching scenario may cause us to miss some use cases.

And we need to make sure that whatever we do here is Native AOT / trimming friendly.

MichalStrehovsky commented 1 year ago

From trimmability perspective, we have two options - we either mark these interface methods as RequiresUnreferenceCode or we don't. If we don't mark them, the implementations need to be trimmable or users will get warnings about problematic code. We probably don't want to mark as RUC. The question is whether the serializers we're looking at are already trimmable (i.e. whether they're a good test on whether this contract makes sense for trimmable implementations).

From AOT perspective, I don't particularly like the generic virtual/generic interface methods. They're not particularly friendly for AOT:

But there's probably no good way around it besides not using the abstraction.

geeknoid commented 1 year ago

@stephentoub The reason for the new() constraint is to support the protobuf deserializer.

So the existing serializers support streaming? Protobuf, protobuf .NET, MessagePak, FlatBuffers, etc?

stephentoub commented 1 year ago

So the existing serializers support streaming? Protobuf, protobuf .NET, MessagePak, FlatBuffers, etc?

e.g. add a package reference to protobuf-net and MessagePack, and this compiles:

static T WithProtobufNet<T>(Stream stream, T value)
{
    stream.Position = 0;
    ProtoBuf.Serializer.Serialize(stream, value);
    stream.Position = 0;
    return ProtoBuf.Serializer.Deserialize<T>(stream);
}

static T WithMessagePack<T>(Stream stream, T value)
{
    stream.Position = 0;
    MessagePack.MessagePackSerializer.Serialize(stream, value);
    stream.Position = 0;
    return MessagePack.MessagePackSerializer.Deserialize<T>(stream);
}

static async ValueTask<T> WithMessagePackAsync<T>(Stream stream, T value)
{
    stream.Position = 0;
    await MessagePack.MessagePackSerializer.SerializeAsync(stream, value);
    stream.Position = 0;
    return await MessagePack.MessagePackSerializer.DeserializeAsync<T>(stream);
}
geeknoid commented 1 year ago

@stephentoub Can you compose together a few pipeline stages using stream without copying? WIth IBufferWriter, I can easily compose a pipeline with zero copies and direct memory access (via spans), and virtual calls only on big chunk boundaries.

stephentoub commented 1 year ago

Can you compose together a few pipeline stages using stream without copying?

What does a pipeline stage that performs a transform look like without "copying"? Typically in these scenarios you're compressing, decompressing, encrypting, decrypting, etc., all of which involve reading data, processing it, and writing the resulting processed data back out. I'm not sure what it means to avoid a copy in most such stages. If I have a compressor stream feeding into an encryption stream, I have a buffer of data that I Write to the compressor. It in turns saves the resulting data out to a buffer, which it then Write's to the encryption stream. If this were instead an IBufferWriter model, I ask the compressor for a buffer I can copy into, it then compresses that, asking the encryptor for a buffer to write into, etc. Each stage still has a buffer, it's just shifted whether it's on the input or output side. It's possible you end up with one more or less "copy" for the whole operation depending on the operation being performed and the model being used, but I don't currently see how it would be zero copies in either model nor an extra copy per stage in either model.

Note I'm not suggesting that any such abstraction if we were to ship one must only support streams; as I outlined, I'd even be ok if the abstract methods were focused solely on in-memory and the default stream-based implementations were wrappers around that. But not supporting streams, both sync and async, as part of such an abstraction would I think be a significant miss.

geeknoid commented 1 year ago

Clearly in a pipeline, there is data flowing from one buffer to another and that's true in both cases. My contention is that in the case of a stream there will be an extra non-transforming copy involved at every stage in the pipeline.

IBufferWriter pipeline:

Stream pipeline, version 1:

Stream pipeline, version 2:

The first example I showed is entirely span-based. It doesn't induce additional copies, doesn't trigger virtual calls on streams, avoids going through the async machinery.

There's a good reason for IBufferWriter to exist.

stephentoub commented 1 year ago

Stream pipeline, version 1: Stream pipeline, version 2:

Neither of those. Let's say I have:

var s = new DeflateStream(new SslStream(new NetworkStream(...), ...), ...);

There's no reading from Streams to pass data between them, as you've outlined in your version 1, as each Stream writes its output buffer directly to the next. And there's no "hundreds of calls" implying fine-grained writes, as there's a buffer at each stage just as there is in the IBufferWriter example. There are the same number of copies here as there is in the IBufferWriter example, there are virtual calls on the Stream just as there are interface calls on the IBufferWriter, and whether there's async machinery is caller's choice as to whether to use the sync or async methods on Stream.

There's a good reason for IBufferWriter to exist.

I'm not sure why you're explaining to me the benefits of a type I helped to design in a library I oversee, nor why you're arguing a position I've not argued against nor seen anyone else on this thread argue against. I've stated multiple times any abstraction we might do in this realm needs to support streams, in addition to anything we do around IBufferWriter, not instead of.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation Binary object serialization is a critical feature of modern software, and the .NET ecosystem includes a number of different serialization models (protobuf, flat buffers, msgpack, etc). The variety of serializations models makes it difficult to create general-purpose abstractions (such as caches) which are independent of the specifics of the serialization format. I propose introducing a base common abstraction to unify these different serialization models. We've implemented this interface in our systems and it works rather wall. We currently support protobuf, protobuf.net, and flat buffer serialization using this interface, which we leverage in our caching infra. We haven't done msgpak yet, but it would fit right in as well. In addition to the basic interface, there would also be implementations of this interface for the different formats (ProtobufSerializer, ProtobufNetSerializer, MsgPakSerializer) ### API Proposal ```csharp namespace System.Buffers /// /// Defines an efficient format-independent model for binary serialization and deserialization. /// public interface IBinarySerializer { /// /// Serializes an object to binary format. /// /// The type of the object to serialize. /// The object to serialize. /// Where to write the serialized data. public void Serialize(T value, IBufferWriter destination) where T : notnull; /// /// Deserializes an object from binary format. /// /// The type of the object to deserialize. /// The buffer of serialized data. /// The deserialized object. public T Deserialize(ReadOnlyMemory data) where T : notnull, new(); } ``` ### API Usage Classic IBufferWriter model: ```csharp [ProtoContract] internal sealed class CustomerPBN { [ProtoMember(1)] public string? FirstName { get; set; } [ProtoMember(2)] public string? LastName { get; set; } [ProtoMember(3)] public string? Address { get; set; } [ProtoMember(4)] public int ApartmentNumber { get; set; } } internal static class Program { public static void Test(IBinrySerializer serializer) { var originalCustomer = new CustomerPBN { FirstName = "Alfred", LastName = "Hitchcock", Address = "123 Hollywood Blvd.", ApartmentNumber = 42, }; // get a buffer writer from the pool to serialize into var serializedData = new ArrayBufferWriter(); // turn the Customer into a buffer of bytes serializer.Serialize(originalCustomer, serializedData); // turn the buffer of bytes back into a Customer object var deserializedCustomer = serializer.Deserialize(serializedData.WrittenMemory); } } ``` ### Alternative Designs Using classic streams for serialization would be possible, but it would be considerably less efficient that IBufferWriter-based processing pipelines. We have combined IBufferWriter-based serialization with an IBufferWriter-based compression library which delivers very high throughput end to end. ### Risks _No response_
Author: geeknoid
Assignees: -
Labels: `api-suggestion`, `area-System.Runtime`, `untriaged`
Milestone: -
joperezr commented 1 year ago

We are evaluating whether or not we need this for .NET 8, @geeknoid will confirm if we can instead push this later.