Perf Thoughts - Githubissues

This test is failing on main, specifically, the count variable is getting set beyond 1


public class MultiThreadedAccessTests
{
    [Test]
    public async Task SingletonOnlyGetsMadeOnce()
    {
        var builder = new DependencyBuilder();
        int count = 0;
        builder.AddSingleton<A>(_ =>
        {
            count++;
            return new A();
        });
        var provider = builder.Build();

        var tasks = Enumerable.Range(0, 10_000).Select(i => Task.Run(async () =>
        {
            await Task.Delay(1);
            provider.GetService<A>();
        }));

        await Task.WhenAll(tasks);

        Assert.That(count, Is.EqualTo(1), "the factory function should only be invoked once.");
    }

    public class A
    {
        // no-op class just for testing
    }
}

Also, looking into the DI types, they use ConcurrentDictionary, likely in an attempt to be "thread safe". However, some very simple benchmarks show that those types are costly.

[Benchmark]
public void Dict_Concurrent()
{
    var x = new ConcurrentDictionary<Type, ServiceDescriptor>();
}
[Benchmark]
public void Dict_Regular()
{
    var x = new Dictionary<Type, ServiceDescriptor>();
}

And the results, (the tldr is that ConcurrentDictionary uses almost 10x the memory that a regular Dictionary does!)

| Method                                | Mean        | Error     | StdDev   | Gen0   | Gen1   | Allocated |
|-------------------------------------- |------------:|----------:|---------:|-------:|-------:|----------:|
| Dict_Concurrent                       |    86.31 ns |  8.274 ns | 0.454 ns | 0.1262 |      - |     792 B |
| Dict_Regular                          |    10.65 ns |  0.630 ns | 0.035 ns | 0.0127 |      - |      80 B |

Indeed, when I ran a benchmark test that only created a DependencyBuilder, and then ran the .Build() method to produce an empty DependencyProvider, according to a benchmark, I'd allocated 4.43 KB worth of memory.

Clearly, there are some problems.

the code is spamming allocations in an attempt to be threadsafe, but
the code is not even threadsafe.

This caused me to go on an adventure!

I ran the above unit test, but as a benchmark instead,

| Method                    | Mean     | Error    | StdDev   | Gen0     | Gen1     | Gen2     | Allocated |
|-------------------------- |---------:|---------:|---------:|---------:|---------:|---------:|----------:|
| SingletonOnlyGetsMadeOnce | 12.10 ms | 0.519 ms | 0.028 ms | 687.5000 | 265.6250 | 171.8750 |   4.22 MB |

The goal now is to fix the two problems... I guess most importantly the code should pass the unit test, performance be damned! But ideally, while I'm sneaking through the code, it would be nice if it allocated less memory and ran faster. I added a simple lock keyword around the entire GetService method, and it raised the mean time a bit, but changed no other performance issues... And it fixed the unit test.

With this in mind, it seems like perhaps the goal of using the concurrent data structures isn't worth the memory allocation cost. I ripped them out (without running all the unit tests yet, so perhaps I broke something), and re-ran my base benchmark where I created an empty DependencyProvider.


| Method              | Mean     | Error    | StdDev  | Gen0   | Gen1.  | Allocated |
|-------------------  |---------:|---------:|--------:|-------:|-------:|----------:|
| BaseCase_Concurrent | 668.7 ns | 80.14 ns | 4.39 ns | 0.7229 | 0.0086 |   4.43 KB |
| BaseCase_Lock       | 235.3 ns | 9.61 ns  | 0.53 ns | 0.1440 | 0      |     904 B |

The allocation goes from 4.4KB to .9KB. And it follows that as less memory is allocated, less time is spent in the GC, so the new method runtime is nearly a third of the old runtime. This feels like a massive improvement. So maybe its worth it to dig in some more and investigate if those ConcurrentDictionary types were really saving us from anything.

On a separate note, the internals of the DependencyProvider keep two caches, one for instantiated singleton objects, and one for instantiated scoped objects. This is wasteful, because you cannot add the same Type as a singleton and a scoped service anyway, so those dictionary caches would never overlap in key-space. Since they never overlap, we can just use a single cache object. This reduces a further 80B of memory.

| Method               | Mean     | Error   | StdDev  | Gen0   | Allocated |
|--------------------- |---------:|--------:|--------:|-------:|----------:|
| BaseCase_SingleCache | 226.1 ns | 7.97 ns | 0.44 ns | 0.1299 |     816 B |

Actually, the same principle can be applied to the DependencyProvider's use of 3 separate dictionaries to hold the ServiceDescriptors for transients, scoped, and singletons. It is invalid to have any time overlap anyway, so those 3 dictionaries can be collapsed into a single dictionary. This saves us a few hundred bytes in the base case.

| Method               | Mean     | Error    | StdDev  | Gen0   | Allocated |
|----------------------|---------:|---------:|--------:|-------:|----------:|
| BaseCase_SingleDict  | 197.0 ns | 10.81 ns | 0.59 ns | 0.0942 |     592 B |

For reference, if all you do, is instantiate the DependencyBuilder, but don't call .Build() on it, that costs 136B. That means there are roughly 450B involved in the basic spin up of a DependencyProvider. A lot of that stuff has to do with a feature I'd love to remove in our DI scheme, called hydration... Indeed, if I run a benchmark where all I do is instantiate a DependencyProvider, it costs the same rough 450B. I don't want to think about the hydration stuff for now, but the same stunt to collapse the various dictionaries can be applied to the builder as well. It was holding 3 separate dictionaries for each lifetime type. I collapsed them into dictionary, and the resulting memory wen't down to 56B. Re-running the base case, then where the empty provider is created through the DependencyBuilder.Build() method,

| Method                     | Mean     | Error    | StdDev  | Gen0   | Allocated |
|--------------------------- |---------:|---------:|--------:|-------:|----------:|
| BaseCase_BuilderSingleDict | 183.4 ns | 10.52 ns | 0.58 ns | 0.0815 |     512 B |

The returns appear to be diminishing. I figured it would be a good time to re-run all the unit tests across cli and microservice to see if I had broken anything. (side note, we should combine our solutions at this point). And low and behold, I had broken something!! I broke 81 tests in microservice, all roughly with the same call stack,

System.ArgumentException : An item with the same key has already been added. Key: Beamable.Server.SocketRequesterContext
   at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)
   at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
   at Beamable.Common.Dependencies.DependencyProvider..ctor(DependencyBuilder builder, BuildOptions options) in /Users/chrishanna/Documents/Github/BeamableProduct/cli/beamable.common/Runtime/Dependencies/DependencyProvider.cs:line 215
   at Beamable.Common.Dependencies.DependencyProvider.Fork(Action`1 configure) in /Users/chrishanna/Documents/Github/BeamableProduct/cli/beamable.common/Runtime/Dependencies/DependencyProvider.cs:line 520
   at Beamable.Server.BeamableMicroService.<Start>b__49_0[TMicroService](MicroserviceArgs conf) in /Users/chrishanna/Documents/Github/BeamableProduct/microservice/microservice/dbmicroservice/BeamableMicroService.cs:line 163

The code that is coming from is this code in the constructor of the DependencyProvider,

foreach (var desc in builder.Descriptors)  
{  
    Descriptors.Add(desc.Interface, desc);  
}

The exception is obvious- the desc.Interface value already exists as a key in the Descriptors dictionary. Where before, there were 3 dictionaries for separate lifetimes, now there is only 1. I guess my assumption that it was invalid to have multiple interfaces per lifetime wasn't quite accurate. To boil backwards, this is happening in the Fork operation from the Microservice base code, where the sub-scope is adding in a scoped version of the _socketRequesterContext.

 _args = args.Copy(conf =>
 {
     conf.ServiceScope = conf.ServiceScope.Fork(builder =>
     {
        // do we need instance specific services? They'd go here.
        builder.AddScoped(_socketRequesterContext);
        builder.AddScoped(_socketRequesterContext.Daemon);
     });
 });

But if I look backwards ad the conf.ServiceScope, there is a singleton version already in existence.

.AddSingleton<SocketRequesterContext>(_ => Instances[0].SocketContext)

The way that dependencies used to get resolved was in a specific ordering of the lifetime types. First transient, then scoped, then singleton. So if you add something as a "short lived" service, then the higher order it had. In this case then, the addition of the scoped service would override the resolution away from the original singleton. To fix this, I allowed the constructor to override the service if is a "higher order" lifetime.


foreach (var desc in builder.Descriptors)
{
    if (Descriptors.TryGetValue(desc.Interface, out var existingDesc))
    {
        if (existingDesc.Lifetime <= desc.Lifetime)
        {
            throw new Exception(
                $"Cannot add service=[{existingDesc.Interface.Name}] to scope as lifetime=[{desc.Lifetime}], because the service has already been added to the scope as existing-lifetime=[{existingDesc.Lifetime}]. ");
        }

        Descriptors[desc.Interface] = desc;
    }
    else
    {
        Descriptors.Add(desc.Interface, desc);
    }
}

And now all the tests are passing again. To ra-cap so far, take this benchmark function

[Benchmark]
public void BaseCase_NoDispose_RegisterAndResolve()
{
    var builder = new DependencyBuilder();
    builder.AddSingleton<TestService>();
    var provider = builder.Build();
    var service = provider.GetService<TestService>();
}

Here are the results... About twice as fast on the runtime, and less than a quarter of the allocation. This test is a pretty vanilla base case. It really isn't doing much at all. its tragic we are allocating an entire kilobyte to instantiate a service, but I still believe that this cost is worth it as the complexity of the dependency scope scales.

| Method                                      | Mean     | Error     | StdDev  | Gen0   | Gen1   | Allocated |
|-------------------------------------------- |---------:|----------:|--------:|-------:|-------:|----------:|
| BaseCase_NoDispose_RegisterAndResolve (old) | 777.6 ns | 154.21 ns | 8.45 ns | 0.7801 | 0.0114 |   4.78 KB |
| BaseCase_NoDispose_RegisterAndResolve (new) | 341.3 ns | 64.85 ns  | 3.55 ns | 0.1683 | 0.     |   1.03 KB |

So there are more problems. After a dependency scope is created, eventually it needs to get shutdown. This happens in our Microsevice framework all the time, because there is a dependency scope per request. The scope is disposed at the end of the request.

So here is another benchmark to see how the disposal method performs... Hint, its not good.

[Benchmark]  
public void BaseCase_Dispose()  
{  
    var builder = new DependencyBuilder();  
    var provider = builder.Build();  
    provider.Dispose();  
}

| Method             | Mean     | Error    | StdDev  | Gen0   | Gen1   | Allocated |
|------------------- |---------:|---------:|--------:|-------:|-------:|----------:|
| BaseCase_NoDispose | 184.9 ns |  8.38 ns | 0.46 ns | 0.0815 |      - |     512 B |
| BaseCase_Dispose   | 451.3 ns | 28.00 ns | 1.53 ns | 0.2332 | 0.0005 |    1464 B |

That Dispose method is doing a lot,

any service that implements IBeamableDisposable, call the OnDispose method and wait for it to finish via a Promise
any service that implements IBeamableDisposableOrder, order them
call Dispose on all child scopes
be "threadsafe".

But it feels absurd that it should cost us a kilobyte to call the dispose method.

One of the first things that happens in the method is that all child scopes are disposed, first, so services die from the bottom of the tree, up towards the root. The Dispose method is async, so we need to wait for all the sub disposals to finish. When I comment this line out, it shaves 200B off the allocation (and tests break).

await Promise.Sequence(childRemovalPromises);

One thing to note is that the order doesn't actually matter for these sub processes to finish, but the implmentation of Promise.Sequence is doing more effort than we need to maintain order. I wrote a Promise.WhenAll method util that returns a Promise that completes when the input list of Promise<T> complete. It is a bit better than Promise.Sequence.


[Benchmark]
public async Task Sequence()
{
    var pList = Enumerable.Range(0, 10).Select(_ => new Promise<int>()).ToList();
    var final = Promise.Sequence(pList);

    var _ = pList.Select(p => Task.Run(async () =>
    {
        await Task.Delay(1);
        p.CompleteSuccess(1);
    })).ToList();

    await final;
}

[Benchmark]
public async Task WhenAll()
{
    var pList = Enumerable.Range(0, 10).Select(_ => new Promise<int>()).ToList();
    var final = Promise.WhenAll(pList);

    var _ = pList.Select(p => Task.Run(async () =>
    {
        await Task.Delay(1);
        p.CompleteSuccess(1);
    })).ToList();

    await final;
}

| Method   | Mean     | Error     | StdDev    | Allocated |
|--------- |---------:|----------:|----------:|----------:|
| Sequence | 1.246 ms | 0.2394 ms | 0.0131 ms |   8.22 KB |
| WhenAll  | 1.219 ms | 0.2327 ms | 0.0128 ms |    7.1 KB |

And when applied to the Dispose method, it brings the allocation down to 1296 B (down almost 200 ), but, its an unfair optimization, because in the test, there are no child scopes, and the implementation for WhenAll has a zero-input base case,


public static Promise WhenAll<T>(List<Promise<T>> promises)
{
    if (promises.Count == 0)
        return Success;

    var result = new Promise();
    Check();
    return result;
    void Check()
    {
        for (var i = 0; i < promises.Count; i++)
        {
            if (promises[i].IsCompleted)
                continue;

            promises[i].Error(ex => result.CompleteError(ex));
            promises[i].Then(_ => Check());
            return;
        }
        result.CompleteSuccess();
    }
}

We'll come back to this when we actually have child scopes to think about... Anyway, the next thing that happens in the Dispose method is that all cached service instances in the scope are disposed, with respect to their order defined by the IBeamableDisposableOrder interface. If I take this bit out, then the memory allocation goes down to 728 B, which is only a few hundred more bytes than the non disposal case. So there are lots of potential gains to be made by looking at this disposal.

Actually, time for a detour, because the act of calling a Promise function, or using an await on a Promise causes allocation. A Promise instantiation takes 104 B in a benchmark, and every time we use async Promise, that implicitly generates a Promise behind the scenes. I removed a tiny bit of unused cruft, and now its 96 B, but still. That feels like a lot.

We use Promise so much over the codebase that I feel like even the tiniest wins inside this library will pay dividends later. In fact, take the following benchmark,

[Benchmark]  
public void PromiseAllocation()  
{  
    var p = new Promise();  
}

When this runs, it allocates 96 B... Why? Becuase inside the Promise, we allocate a new object() to be used as a reference value for the lock keywords used through the Then/Error methods. According to Microsoft, you should never use the this keyword value as the reference value for a lock statement, because someone else might use the value in another lock statement, leading to a deadlock. This is a tricky scenario, because if I change our implementation to use this, all of the allocation goes away (for the instantiatn, allocation will re-appear during the usage). I am of the opinion at the moment that this is a worthy trade, given

how often these Promise types are instantiated, and
how unlikely it is for someone to lock a Promise. I think we can curtail this in documentation.

Moving on, our async/await code around Promise costs memory. This benchmark shows that simply adding async to the method signature will cost memory.


[Benchmark]  
public Promise ReturnPromise()  
{  
    var p = new Promise();  
    return null;  
}  

[Benchmark]  
public async Promise ReturnAsyncPromise()  
{  
    var p = new Promise();  
}

The async version costs 96 B more memory.

| Method             | Mean       | Error     | StdDev    | Gen0   | Allocated |
|------------------- |-----------:|----------:|----------:|-------:|----------:|
| ReturnPromise      |  0.0000 ns | 0.0000 ns | 0.0000 ns |      - |         - |
| ReturnAsyncPromise | 43.5974 ns | 1.9522 ns | 0.1070 ns | 0.0153 |      96 B |

That number should look familiar, because it was the same amount of memory the new object() took for the previous lock statement. This is a snippet from the async support code from Promise.

public sealed class PromiseAsyncMethodBuilder
{
    private IAsyncStateMachine _stateMachine;
    private Promise _promise = new Promise(); // TODO: allocation.

    public static PromiseAsyncMethodBuilder Create()
    {
        return new PromiseAsyncMethodBuilder();
    }

I took a lot of inspiration (aka, stole) from this article about UniTask. The PromiseAsyncMethodBuilder is getting instantiated to build the state machine for the async handling. However, it doesn't need to be a class, and can be converted into a struct, which means its allocation goes away.

| Method                      | Mean       | Error     | StdDev    | Gen0    | Allocated |
|---------------------------- |-----------:|----------:|----------:|--------:|----------:|
| ReturnAsyncPromise (prev)   | 43.5974 ns | 1.9522 ns | 0.1070 ns | 0.0153  |      96 B |
| ReturnAsyncPromise (struct) | 35.4175 ns | 3.3815 ns | 0.1853 ns | 0.0102  |      64 B |

The remaining 64 B aren't actually the async's fault, but something that happens during the async state machine, which is that the Promise is completed.

Consider this benchmark, that compares a promise instantiation to an instantiation and a CompleteSuccess call. The 64 B looks familiar, eh?

[Benchmark]  
public void PromiseComplete()  
{  
    var p = new Promise();  
    p.CompleteSuccess();  
}

| Method            | Mean       | Error     | StdDev    | Median     | Gen0   | Allocated |
|------------------ |-----------:|----------:|----------:|-----------:|-------:|----------:|
| PromiseComplete   | 33.1745 ns | 0.9919 ns | 0.0544 ns | 33.2017 ns | 0.0102 |      64 B |
| PromiseAllocation |  0.0018 ns | 0.0559 ns | 0.0031 ns |  0.0001 ns |      - |         - |

Where is that 64 B coming from? There are 2 main causes...

the CompleteSuccess method takes a generic <T>, which leads to boxing, and
the base Promise type is secretly a Promise<Unit> under the hood, so passing PromiseBase.Unit to CompleteSuccess is passing the struct by value. We need to pass it by reference to avoid the allocation.

It was at this point I realized I'd been doing my benchmarks wrong this whole time. When returning void from a benchmark, it may be that you've optimized the code so much, that the compiler thinks nothing actually needs to happen, because there are no side effects. So I went back and changed my Promise allocation benchmark to actually return the generated Promise, and I saw that the allocation itself was responsible for the 64 B, not the COmpleteSuccess. This also means that the async Promise method was "correct" in that it returned something; where as the void method was "incorrect". After I fixed my benchmarks, I saw no difference between simply returning a new Promise vs doing it implicitly with an async Promise. I would still like to to remove more allocation here, but I think the way to do it would be with Promise pooling. However, pooling at this level would introduce some new constraints on the behaviour of the Promise library that I don't feel comfortable doing in the blind.

| Method             | Mean     | Error     | StdDev    | Gen0   | Allocated |
|------------------- |---------:|----------:|----------:|-------:|----------:|
| PromiseAllocation  | 3.773 ns | 3.0118 ns | 0.1651 ns | 0.0102 |      64 B |
| ReturnAsyncPromise | 5.300 ns | 0.5746 ns | 0.0315 ns | 0.0102 |      64 B |

So it is time to return to the Dispose method's internal call to dispose all of its internal services... Perhaps an easier target to go after are some LINQ statements inside the method.

I tore out a lot of code in the method (so it doesn't work), to isolate one specific line,

InstanceCache.Values.Distinct()

| Method            | Mean     | Error    | StdDev  | Gen0   | Allocated |
|------------------ |---------:|---------:|--------:|-------:|----------:|
| WithDistinct      | 234.4 ns | 12.56 ns | 0.69 ns | 0.1185 |     744 B |
| WithoutDistinct   | 303.2 ns | 14.27 ns | 0.78 ns | 0.1044 |     656 B |

Perhaps not surprisingly, there is some allocation here (88 B), and ultimately, I don't understand why we need to do this call to Distinct at all. I remember writing this code, and I remember noticing that it was possible for the same instance to appear in the cache more than once. That was bad, because if we call the Dispose methods on the instance more than once, all the logic gets broken. However, I am sad with Past-Me for not drilling into how the multiple instances appeared in the list in the first place, because that seems like the real problem. I am going to boldly ignore the root cause, and remove the Distinct call, which brings the simple access of InstanceCache.Values down to 680 B from 744 B, saving 64 B.

Moving on, inside the helper method that disposes all the services, the first piece of business is to sort the services into groups based on their possible ordering.

var clonedList = new List<object>(services);
var groups = clonedList.GroupBy(x =>
{
    if (x is IBeamableDisposableOrder disposableOrder)
        return disposableOrder.DisposeOrder;
    return 0;
});
groups = groups.OrderBy(x => x.Key);

Two things jump out at me,

the clonedList allocates, but it doesn't need to exist, since the LINQ GroupBy is going to re-allocate a new structure anyway. The clonedList existed from a time when I wanted to avoid possible collection modification errors.
the LINQ statements are cute for doing the logic I need, but I'm curious if there is a less memory-intensive way to write this.

Calling this method with everything else commented out, so that it only runs the code above, the allocation for the entire Dispose() method jumps to 1304 B. I may just be shifting complexity around, but I decided to move the sorting of the services out of the Dispose method, and track it as additional state as the services are resolved. This raises the floor on the base case without calling Dispose, but removes the need to do any sorting within Dispose, thus removing LINQ and dropping the allocation. We should always be calling Dispose, so I'm valuing that case over the non disposal case.

This benchmark is still using broken code, but it illustrates using a precached sorted datastructure instead of the code above.

| Method             | Mean     | Error    | StdDev  | Gen0   | Allocated |
|------------------- |---------:|---------:|--------:|-------:|----------:|
| BaseCase_NoDispose | 197.9 ns |  7.84 ns | 0.43 ns | 0.1006 |     632 B |
| BaseCase_Dispose   | 320.7 ns | 28.95 ns | 1.59 ns | 0.1335 |     840 B |

--more to do

beamable / BeamableProduct

Perf Thoughts #3498