ch-robinson / dotnet-avro

An Avro implementation for .NET
https://engineering.chrobinson.com/dotnet-avro/
MIT License
136 stars 52 forks source link

AsyncSchemaRegistryDeserializer creates new delegate for each invocation of any deserializer #314

Open fabianoliver opened 4 months ago

fabianoliver commented 4 months ago

I was going through a memory dump and noticed unusually high allocations coming from calls to AsyncSchemaRegistryDeserializer<T>.DeserializeAsync on a process running on Win10 under .NET8.

The allocated types were of type Delegate, with their call stack going DeserializeAsync -> System.Reflection.Emit.DynamicMethod.CreateDelegate(Type, Object) -> Delegate.CreateDelegatenoSecurityCheck.

It appears that every call to DeserializeAsync leads to a new delegate being created for every single deserialization call (!!). For example, set breakpoint in Delegate.CoreCLR.cs -> CreateDelegateNoSecurityCheck, and run something like

  var s = new AsyncSchemaRegistryDeserializer<Something>(...);
  await s.DeserializeAsync(somePayload, false, default);
  await s.DeserializeAsync(somePayload, false, default);
  await s.DeserializeAsync(somePayload, false, default);

CreateDelegateNoSecurityCheck will be hit (at least?) three times. It's not clear to me yet why that is happening


More self-contained example to reproduce this:

    private readonly record struct Dto;

    [Test]
    public void test()
    {
        var schema = new JsonSchemaReader().Read("{  \"name\": \"Dto\",  \"type\": \"record\",  \"fields\": []}");
        var inner = new BinaryDeserializerBuilder().BuildDelegateExpression<Dto>(schema);

        var data = new byte[] { };
        var innerCompiled = inner.Compile();

        // Breakpoint in Delegate.CreateDelegateNoSecurityCheck
        var rr = new BinaryReader(data);
        innerCompiled(ref rr);
        innerCompiled(ref rr);
    }
dstelljes commented 4 months ago

Hmm, I would guess that this is because inner LambdaExpressions aren't compiled when the outer expression is compiled. Chr.Avro generates a LambdaExpression for each record type to support recursive references. (To confirm, you should be able to verify that (de)serializers for non-record types do not call CreateDelegate.)

Given this, we might have to revisit how we build expression trees.

fabianoliver commented 4 months ago

I'm still trying to understand what .NET is doing under the hood there. I don't think it's an issue of compilation per se - in the sense that the nested lamda is compiled fine I think. I'm not even sure it'd really count as a proper bug, whether delegates are or are not cached is definitely more art than science in C# I think, with the how and ifs even having changed over language versions. So probably "just" more of a performance downside particularly noticeable in scenarios with fairly high message rates.

As for delegate creation, my understanding was that non-capturing lambdas are cached, but that doesn't seem to be happening for expressions (though I'd need to double check if the language specs say they can be cached, or must be cached).

The simplest example I could condense it down to so far is the following:

var param = Expression.Parameter(typeof(int));
var innerParam = Expression.Parameter(typeof(Action<int>));
var inner = Expression.Lambda(innerParam.Type, Expression.Block(), "test", new[] { param });
var action = Expression.Lambda<Action<int>>(Expression.Block(new[] { innerParam }, Expression.Assign(innerParam, inner), Expression.Invoke(innerParam, param)), param).Compile();

// Each invocation will trigger delegate creation
action(456);
action(567);

Interestingly, I don't observe the same behaviour when creating an equivalent function/invocation structure directly in code outside of the expression system:

Action<int> inner = _ => { };
Action<int> action= x =>
{
  Action<int> innerParam = inner;
  innerParam(x);
};

// Delegates are cached just fine
action(456);
action(567);

Even more interestingly, FastExpressionCompiler seems to be doing a better job here. If I compile any of my samples with that, delegates do seem to be cached fine - though I may have to run this through for more complex examples.

fabianoliver commented 4 months ago

After a bit more testing - FastExpressionCompiler isn't able to compile more complex lambda expressions / throws in those cases, so unfortunately doesn't seem to be a great workaround.

Thinking about this a little further, I suspect that in the current way expression trees are created, re-creating delegates might be unavoidable in some cases. Not in the simple examples above, but at least once you have types whose properties are other complex types (whether self-referential or not) - if I understand the current generators right, their lambda would reference variables from the outer scope to access their nested types' deserializers. I think that would inevitably make the lambda capture the outer parameter, and capturing lambdas are never cached. Though hard to verify at the moment given even the simple, non-capturing versions aren't cached in the default compilation path.

I think a few options in increasing order of madness that come to mind at the moment are:

Maybe the first option isn't so bad after all.. :-)

dstelljes commented 3 months ago

Thanks for the detailed writeup! A few additional thoughts:

Possibly do a lazy workaround that'd generate a simpler expression tree in case where the record is a simple type that contains only primitive fields.

Of the options you laid out, something like that is probably the most pragmatic. (A more sophisticated variation might be to walk the type graph and only do the circular reference trick for types that appear in a cycle.)

Maybe the first option isn't so bad after all.. :-)

This is more or less where we're at too. I'm going to leave this as a bug since it's an unintended consequence of how the serdes are built, but it's unlikely that we'll be able to prioritize working on it.