Cysharp / MemoryPack

Zero encoding extreme performance binary serializer for C# and Unity.
MIT License
3.29k stars 193 forks source link

Support for deserialization "into" existing collection (instead of overwrite) #337

Open dusrdev opened 1 week ago

dusrdev commented 1 week ago

In order to reduce allocations of collections when deserializing, MemoryPackSerializer has a method that overwrites an existing instance. For 1 item this is fine, but for it would have been very helpful for collections that instead of overwriting or replacing them, it would serialize into the available space.

For instance, if a T[] is used there, the entire instance is replaced, regardless if the deserialized items could have been inserted from some index to some index. In some cases, the opposite would have been very efficient.

For the sake of the example, consider that I use MemoryPack to serialize a User[] in some API. In the crud operation of adding a new user, right now I have to do this:

User[] currentUsers = MemoryPack.Deserialize<User[]>(binaryData);
User[] usersAfterAddition = new User[currentUsers.Length + 1];
currentUsers.CopyTo(usersAfterAddition);
usersAfterAddition[^1] = newUser;
binaryData = MemoryPack.Serialize(usersAfterAddition);

So I am forced to allocate a secondary array with larger length just to add items. I want to be able to do this:

User[] currentUsers = new User[currentUserCount + 1];
Span<User> current = new Span(currentUsers, 0, currentUserCount);
int written = MemoryPackSerializer.Deserialize(binaryData, current);
// I could use written in case I want to advance some index, perhaps as part of buffer or whatever
// for the sake of the example I won't use ^1 and use the written instead
currentUsers[written] = newUser;
binaryData = MemoryPack.Serialize(currentUsers);

This essentially uses a destination span for the required items, it could be a part of a larger buffer (which also enables usage with BufferWriter(s), Arrays rented from pools etc... And using this functionality I completely avoid allocating an array of size n. (this is big, especially if the usage is in an API like in the example, which calls it lots of times, using an array from ArrayPool.Shared would increase the benefit immensely, and the same API would support it).

Currently, it is impossible to do from my tries, firstly because the overload of Deserialize that overwrites the instance uses a ref object? obj for the instance, and obviously span<t> cannot be cast to object. I have tried to implement a workaround using Memory, like this:

User[] currentUsers = new User[currentUserCount + 1];
Memory<User> current = new Memory(currentUsers, 0, currentUserCount);
ref object? obj = ref Unsafe.As<Memory<User>, object?>(ref current);
int written = MemoryPackSerializer.Deserialize(binaryData, ref obj);
// I could use written in case I want to advance some index, perhaps as part of buffer or whatever
// for the sake of the example I won't use ^1 and use the written instead
currentUsers[written] = newUser;
binaryData = MemoryPack.Serialize(currentUsers);

Which logically should produce similar results as Memory.Slice is also very efficient and doesn't actually change the backing array. But even after the Deserialize call, my debugger showed that the backing array was unchanged (remained only initialized without the deserialized data). And my unit tests failed.

Btw, as it currently stands, written doesn't actually tell the count of deserialized items, it just returns the length of binaryData which means if you are expecting to get 3 Users, from a byte[] of length 80 for instance, written would be 80, and trying to use it to advance indexes will throw you out of bounds. (So this is also a bug)

I know that implementing such a change would lead to thinking that it would mean a lot of overloads and code to add, but you don't really need much, supporting Span<T> would cover 99% of cases, as you can get the span out of arrays and lists and they are the most common data types for such operations.

dusrdev commented 1 week ago

I managed to find a workaround for now:

// Input
ReadOnlySpan<byte> serialized = ... // This is the serialized data
int length = ... // length of T[] before serialization
// Actual work
var state = MemoryPackReaderOptionalStatePool.Rent(null);
var reader = new MemoryPackReader(data, state);
var arr = new T[length + 1];
Span<T?> span = arr.AsSpan(0, length)!;
reader.ReadSpan(ref span);
arr[length] = new T() // the added item
// Then proceed to serialize after addition
var newSerialized = MemoryPackSerializer.Serialize(arr);

This works, and definitely better than allocating a new array just for deserialization. However, it is still suboptimal as:

By the way - BUG: as the documentation explains, the overwrite method, if receives a collection that has a .Clear() method, will instead just clear the existing instance and reuse it. I tried to leverage this to hack a different workaround, such as:

public class ICollectionWrappedArray<T> : ICollection<T> {
    private readonly T[] _arr;
    private int _index;
    public ICollectionWrapperArray(T[] arr) {
        _arr = arr;
    }
    // Empty ctor required for MemoryPack for some reason - otherwise runtime error
    public ICollectionWrapperArray() { _arr =[]; }
    public void Add(T item) {
        _arr[_index] = item;
        _index++;
     }
     public void Clear() { _index = 0; }
     // Rest of implementation doesn't really matter for this use case
}

// The instead of the span workaround above
T[] newArr = new[length + 1];
var collection = new ICollectionWrappedArray(newArr);
int written = MemoryPackSerializer.Deserialize(serializedData, ref collection);

This should have worked according to the documentation, as the collection has .Clear method, it should have just reset the index (which was 0 in any case), then just added to the array.

But, my tests failed, and after debugging I saw that it always went to the empty ctor, causing it to allocate a new array instead of using the one that was provided.