AsyncSchemaRegistryDeserializer creates new delegate for each invocation of any deserializer

I was going through a memory dump and noticed unusually high allocations coming from calls to AsyncSchemaRegistryDeserializer<T>.DeserializeAsync on a process running on Win10 under .NET8.

The allocated types were of type Delegate, with their call stack going DeserializeAsync -> System.Reflection.Emit.DynamicMethod.CreateDelegate(Type, Object) -> Delegate.CreateDelegatenoSecurityCheck.

It appears that every call to DeserializeAsync leads to a new delegate being created for every single deserialization call (!!). For example, set breakpoint in Delegate.CoreCLR.cs -> CreateDelegateNoSecurityCheck, and run something like

  var s = new AsyncSchemaRegistryDeserializer<Something>(...);
  await s.DeserializeAsync(somePayload, false, default);
  await s.DeserializeAsync(somePayload, false, default);
  await s.DeserializeAsync(somePayload, false, default);

CreateDelegateNoSecurityCheck will be hit (at least?) three times. It's not clear to me yet why that is happening

More self-contained example to reproduce this:

    private readonly record struct Dto;

    [Test]
    public void test()
    {
        var schema = new JsonSchemaReader().Read("{  \"name\": \"Dto\",  \"type\": \"record\",  \"fields\": []}");
        var inner = new BinaryDeserializerBuilder().BuildDelegateExpression<Dto>(schema);

        var data = new byte[] { };
        var innerCompiled = inner.Compile();

        // Breakpoint in Delegate.CreateDelegateNoSecurityCheck
        var rr = new BinaryReader(data);
        innerCompiled(ref rr);
        innerCompiled(ref rr);
    }

Hmm, I would guess that this is because inner LambdaExpressions aren't compiled when the outer expression is compiled. Chr.Avro generates a LambdaExpression for each record type to support recursive references. (To confirm, you should be able to verify that (de)serializers for non-record types do not call CreateDelegate.)

Given this, we might have to revisit how we build expression trees.

I'm still trying to understand what .NET is doing under the hood there. I don't think it's an issue of compilation per se - in the sense that the nested lamda is compiled fine I think. I'm not even sure it'd really count as a proper bug, whether delegates are or are not cached is definitely more art than science in C# I think, with the how and ifs even having changed over language versions. So probably "just" more of a performance downside particularly noticeable in scenarios with fairly high message rates.

As for delegate creation, my understanding was that non-capturing lambdas are cached, but that doesn't seem to be happening for expressions (though I'd need to double check if the language specs say they can be cached, or must be cached).

The simplest example I could condense it down to so far is the following:

var param = Expression.Parameter(typeof(int));
var innerParam = Expression.Parameter(typeof(Action<int>));
var inner = Expression.Lambda(innerParam.Type, Expression.Block(), "test", new[] { param });
var action = Expression.Lambda<Action<int>>(Expression.Block(new[] { innerParam }, Expression.Assign(innerParam, inner), Expression.Invoke(innerParam, param)), param).Compile();

// Each invocation will trigger delegate creation
action(456);
action(567);

Interestingly, I don't observe the same behaviour when creating an equivalent function/invocation structure directly in code outside of the expression system:

Action<int> inner = _ => { };
Action<int> action= x =>
{
  Action<int> innerParam = inner;
  innerParam(x);
};

// Delegates are cached just fine
action(456);
action(567);

Even more interestingly, FastExpressionCompiler seems to be doing a better job here. If I compile any of my samples with that, delegates do seem to be cached fine - though I may have to run this through for more complex examples.

After a bit more testing - FastExpressionCompiler isn't able to compile more complex lambda expressions / throws in those cases, so unfortunately doesn't seem to be a great workaround.

Thinking about this a little further, I suspect that in the current way expression trees are created, re-creating delegates might be unavoidable in some cases. Not in the simple examples above, but at least once you have types whose properties are other complex types (whether self-referential or not) - if I understand the current generators right, their lambda would reference variables from the outer scope to access their nested types' deserializers. I think that would inevitably make the lambda capture the outer parameter, and capturing lambdas are never cached. Though hard to verify at the moment given even the simple, non-capturing versions aren't cached in the default compilation path.

I think a few options in increasing order of madness that come to mind at the moment are:

Keep everything as-is. Which might be a perfectly sane choice if the alternatives are too complex / not worth the gain in performance and reduced allocs!!
Possibly do a lazy workaround that'd generate a simpler expression tree in case where the record is a simple type that contains only primitive fields. That might be the quickest win that'd probably benefit a reasonable percentage of real use cases
Try to wire up circular references in the expression tree in another way. An interesting technique for anonymous recursion is described here for example - maybe it'd similarly be able to create some variation of delegate object DeserializeRecord(Type dtoType, string recordName, ref BinaryReader reader), then have each generated record deserializer take a delegate of that signature that'd allow it to deserialize nested types without having to capture/reference something from the outer scope? Hard to say for sure though if that'd really do the trick
Instead of an expression tree, explicitly create an AnonymousType with one AnonymousMethod to deserialize each record type. Emitting IL directly sounds like a nightmare though and I'm not sure if there's much in terms of supporting libraries out there that could make it much easier.
Ditching all that; generate "proper" full source for a deserializer, and use Roslyn to compile and load up the assembly. It wouldn't get more flexible than that, and leaving behind "almost-abandoned-by-microsoft" expression trees could be nice as well. However.. pulling in those giant dependencies, plus dealing with possibly breaking changes (eg definite no for AOT or mobile devices; though not quite sure if expression trees work there either) probably makes this a non-starter anyway

Maybe the first option isn't so bad after all.. :-)

Thanks for the detailed writeup! A few additional thoughts:

Possibly do a lazy workaround that'd generate a simpler expression tree in case where the record is a simple type that contains only primitive fields.

Of the options you laid out, something like that is probably the most pragmatic. (A more sophisticated variation might be to walk the type graph and only do the circular reference trick for types that appear in a cycle.)

Maybe the first option isn't so bad after all.. :-)

This is more or less where we're at too. I'm going to leave this as a bug since it's an unintended consequence of how the serdes are built, but it's unlikely that we'll be able to prioritize working on it.

ch-robinson / dotnet-avro

AsyncSchemaRegistryDeserializer creates new delegate for each invocation of any deserializer #314