AdrianStrugala / AvroConvert

Rapid Avro serializer for C# .NET
Other
97 stars 27 forks source link

Added ability to de/serialize from/to avro container. #119

Open wdcossey opened 9 months ago

wdcossey commented 9 months ago

Avro Container

*** I haven't had time to test everything as I only needed serialization for Azure Data Explorer

AdrianStrugala commented 9 months ago

Thank you for your contribution, I will do my best to review it today. Looks pretty nice at first glance!

wdcossey commented 9 months ago

Thank you for your contribution, I will do my best to review it today. Looks pretty nice at first glance!

Pushed a bug-fix and some enhancements.

AdrianStrugala commented 9 months ago

Very nice PR, thank you. Just two minor comments from my side. When you address them, I will merge the PR, write a short doc, and create the next release.

gmanvel commented 7 months ago

@AdrianStrugala @wdcossey what's the decision with this PR ? In general it looks like a deviation from an AvroConvert API, which, in my impression, follows Newtonsoft.Json.JsonConvert API approach, e.g. JsonConvert.SerializeObject takes care of serializing all object types and there are no specialized methods for specific types. It would also mean all existing clients would need to make code changes to benefit from this. We could change Serialize method to apply this approach when passed object is a collection, e.g.

/// <summary>
/// Serializes given object into Avro format (including header with metadata)
/// Choosing <paramref name="codecType"/> reduces output object size
/// </summary>
public static byte[] Serialize(object obj, CodecType codecType)
{
    var schema = Schema.Create(obj);

    if (schema is ArraySchema && !obj.GetType().IsDictionary())
    {
        var enumerator = ((IEnumerable)obj).GetEnumerator();

        enumerator.MoveNext();
        var first = enumerator.Current;

        var itemSchema = Schema.Create(first);

        enumerator.Reset();
        using (MemoryStream resultStream = new MemoryStream())
        {
            using (var writer = new Encoder(itemSchema, resultStream, codecType))
            {
                while (enumerator.MoveNext())
                {
                    var item = enumerator.Current;
                    writer.Append(item);
                }
            }

            byte[] result = resultStream.ToArray();
            return result;
        }
    }
    else
    {
        using (MemoryStream resultStream = new MemoryStream())
        {
            using (var writer = new Encoder(schema, resultStream, codecType))
            {
                writer.Append(obj);
            }
            byte[] result = resultStream.ToArray();
            return result;
        }
    }
}

From the other side, this is potentially a breaking change. While AvroConvert.Deserialize can successfully deserialize .avro files generated this way, the byte content of files (generated before/after this change) are not the same.

I would suggest to make a decision and implement this change in the library as there are big perf improvements

UserCount Original Mean (ms) Improved Mean (ms) Mean Improvement (%) Original Allocated (MB) Improved Allocated (MB) Allocation Improvement (%)
100 2.932 0.9109 68.9% 2.15 1.57 27.0%
1000 12.314 7.8920 35.9% 19.46 12.79 34.3%
10000 123.433 103.5033 16.1% 217.91 151.68 30.4%

Benchmark used to compare nuget AvroConvert v3.4.0 vs AvroConvert.Serialize with the support to serialize array items into separate blocks

[MemoryDiagnoser]
public class AvroConvertSerializeArray
{
    [Params(100, 1_000, 10_000)]
    public int UserCount;
    private User[] _data;

    [GlobalSetup]
    public void Setup()
    {
        Fixture fixture = new Fixture();
        _data = fixture
            .Build<User>()
            .With(u => u.Offerings, fixture.CreateMany<Offering>(21).ToList)
            .CreateMany(UserCount)
            .ToArray();
    }

    [Benchmark]
    public byte[] Serialize() => AvroConvert.Serialize(_data);

}
AdrianStrugala commented 4 months ago

Hey, I am going to implement this in a similar way that you've suggested Manvel. The point is, that this is in fact a breaking change and I would make it part of the v4 release. Adrian