Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
741 stars 494 forks source link

Hang on calling database from Lazy<T> initialization by multiple threads #1067

Closed lukasz-pyrzyk closed 4 years ago

lukasz-pyrzyk commented 4 years ago

Looks closely related to the #1043

Describe the bug I've migrated two projects from SDK v2 to V3. After migration I noticed that tests are dramatically slower and they hang randomly. They all pass only when parallel execution is disabled.

To Reproduce I've submitted repro to the https://github.com/lukasz-pyrzyk/CosmosDbHangWithParallelTests. It has also branch with SDK V2, which works correctly.

Worth to mention CheckIfDatabaseExists looks like a root cause of the issue. It does an async database call and it's called from the Lazy. We usually use this concept when we want to initialize a single instance across the app and run some provisioning, like upsert of the stored procedures.

        private async Task<Database> CheckIfDatabaseExists(Settings settings)
        {
            var database = _cosmosClient.GetDatabase(settings.DatabaseId);
            var read = await database.ReadStreamAsync().ConfigureAwait(false);
            if (read.StatusCode == HttpStatusCode.NotFound)
            {
                throw new Exception($"CosmosDB database {settings.DatabaseId} does not exist!");
            }

            return database;
        }

During the investigation, I have replaced the async call to the database inside CheckIfDatabaseExists with await Task.Delay(100).ConfigureAwait(false);

        private async Task<Database> CheckIfDatabaseExists(Settings settings)
        {
            var database = _cosmosClient.GetDatabase(settings.DatabaseId);
            await Task.Delay(100).ConfigureAwait(false);
            return database;
        }

with the following change, tests finish successfully, even in parallel.

Environment summary SDK Version: 3.4.1 OS Version: Windows 10

cc @ealsur @j82w

j82w commented 4 years ago

Calling async code from sync code is known to cause deadlocks. Is there any reason you are not using an AsynLazy to avoid sync code?

lukasz-pyrzyk commented 4 years ago

Yes, the project can benefit from using AsyncLazy here. However, I've submitted the issue because it looks like there is some issue with the synchronization context in SDK 3, which is quite difficult to find out and investigate during the migration.

j82w commented 4 years ago

I forked the repository, and the tests are passing for me. I'm using vs2019 with run all tests in parallel. VS shows each test class running in parallel, but only one test at a time in each of the classes. It also seems to be creating a new client and lazy database for each test.

Tests
lukasz-pyrzyk commented 4 years ago

Interesting. I've tested running tests on VS Test explorer, Resharper and dotnet test. In all cases, it hangs without any test passed. U use .NET Core 3.0.100. vs resharper cmd

Did you have a clean master branch?

j82w commented 4 years ago

The only difference is I hard coded the the setting as string instead of using environment variables. What's weird is it seems to be creating a new CosmosClient for every test. If that is the case there shouldn't be any difference in running the tests in parallel since every run has it's own CosmosClient instance. image

lukasz-pyrzyk commented 4 years ago

This is very weird. I've checked with my second laptop and another one from my colleague, it has hanged in both cases. I've updated to 3.5.0 and .NET Core 3.1, results are the same, a hang.

As far as I understand xUnit execution engine, each test should have a different instance of the CosmosDB repository. However, xunit reuses process and for example, static data might be shared between test fixtures. I thought it might be the issue.

I've started to think if it might be related to geo-location or database settings, somehow. I use west-europe region, created with default settings, without any replication.

Would you be interested in some call and talk about it? I can also try to setup a VM on Azure with repro.

j82w commented 4 years ago

This seems very similar to #1043 . @ealsur is working on a fix that will hopefully fix this issue too.

lukasz-pyrzyk commented 4 years ago

Hey @j82w

I've downloaded 3.5.1 and unfortunately, it doesn't fix the issue. Since you wasn't able to reproduce the issue on your machine, I tried to reproduce it insice docker container. Can you try to follow the steps in https://github.com/lukasz-pyrzyk/CosmosDbHangWithParallelTests#3-repro-with-docker and see if you can reproduce the hang?

ealsur commented 4 years ago

Based on how you are using the Repository, shouldn't it be in a Class Fixture instead of the constructor?

ealsur commented 4 years ago

Also see https://github.com/xunit/xunit/issues/864

lukasz-pyrzyk commented 4 years ago

Based on how you are using the Repository, shouldn't it be in a Class Fixture instead of the constructor?

It is a simple repro. In real app Repository comes from Microsoft.AspNetCore.Mvc.Testing and it's managed by WebApplicationFactory<TStartup>.Services.

Also see xunit/xunit#864

I understand that sync-over-async it's not recommended. However, I have a feeling that SDK v3 has a strange behavior. Moving from Lazy<T> to AsyncLazy<T> makes the tests passing, but I'm not sure if this is the correct fix.

Why?

Let's take a loon on the factory method from the Lazy<T> in the Repository class

public Repository(Settings settings)
        {
            _settings = settings ?? throw new ArgumentNullException(nameof(settings));

            _cosmosClient = new CosmosClient(settings.Endpoint, settings.Key);
            _clientFactory = new Lazy<Database>(() =>
            {
                var database = CheckIfDatabaseExists(settings).ConfigureAwait(false).GetAwaiter().GetResult();

                return database;
            });
        }

private async Task<Database> CheckIfDatabaseExists(Settings settings)
        {
            var database = _cosmosClient.GetDatabase(settings.DatabaseId);
            var read = await database.ReadStreamAsync();
            if (read.StatusCode == HttpStatusCode.NotFound)
            {
                throw new Exception($"CosmosDB database {settings.DatabaseId} does not exist!");
            }

            return database;
        }

CheckIfDatabaseExists is synchronized with .ConfigureAwait(false).GetAwaiter().GetResult();, so it's synchronization context shoudn't be an issue. However, when it is called by multiple threads from xUnit it hangs.

The problematic line is var read = await database.ReadStreamAsync(); Let's replace it with any other async operations, like Task.Delay or IO operation with FIle.*Async.

        private async Task<Database> CheckIfDatabaseExists(Settings settings)
        {
            var database = _cosmosClient.GetDatabase(settings.DatabaseId);
            await File.ReadAllTextAsync(@"C:\temp\a.txt");
            await Task.Delay(100);

            return database;
        }

With this change, there is no hang and tests pass successfully. I can also replace it with the same logic code, but written in SDK v2.and it still works, code is in the branch https://github.com/lukasz-pyrzyk/CosmosDbHangWithParallelTests/blob/sdkv2/Tests/Cosmos/Repository.cs

Can you explain to me why await database.ReadStreamAsync() behaves differently than previously mentioned async operations?

I also found it that repro depends on the amount of cores. For example, it does reproduce on i5 with 2 cores, but it works normally with i7 with 4 cores. That is why @j82w` was not able to reproduce it locally. Because of that, I have created a docker container and limited CPU to 1. It also reproduced the issue.

ealsur commented 4 years ago

From the Xunit thread, it seems that Xunit will create a SynchronizationContext with a max parallel threads based on some parameter, possibly the # of cores. SDK 3.5.1 will, if a SynchronizationContext is detected, use Task.Run to wrap the desired operation:

The only difference I see between the scenario we have in our tests (https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/SynchronizationContextTests.cs#L70) and your repro is that the lock is not happening on the SDK call itself, but through the Lazy.

The ReadDatabase operation is basically a call through HttpClient in the end, both the V2 and (now in 3.5.1) wrap the call in Task.Run if a context is detected. Unless there is some further conflict between Xunit's SynchronizationContext behavior and another component (like HttpClient) that might have a difference in V3, I really can't see the source of the problem. V2 by default uses Gateway mode for Connectivity, while V3 uses Direct mode, maybe there is something in the TCP stack that is conflicting with Xunit's SynchronizationContext, you could try to set the V3 Connectivity to Gateway and see how it behaves?

lukasz-pyrzyk commented 4 years ago

OK, I have the issue.

I noticed that ConfigureAwait was missing in the call method calling ReadStreamAsync. Because of that, continuation wasn't able to start inside lazy when more threads were not available.

I have double-checked and I'm happy to say that issue was occurring on SDK 3.4.1, but it's fixed in 3.5.1.

Thanks for help