Closed lukasz-pyrzyk closed 4 years ago
Calling async code from sync code is known to cause deadlocks. Is there any reason you are not using an AsynLazy to avoid sync code?
Yes, the project can benefit from using AsyncLazy here. However, I've submitted the issue because it looks like there is some issue with the synchronization context in SDK 3, which is quite difficult to find out and investigate during the migration.
I forked the repository, and the tests are passing for me. I'm using vs2019 with run all tests in parallel. VS shows each test class running in parallel, but only one test at a time in each of the classes. It also seems to be creating a new client and lazy database for each test.
Interesting. I've tested running tests on VS Test explorer, Resharper and dotnet test
. In all cases, it hangs without any test passed. U use .NET Core 3.0.100
.
Did you have a clean master branch?
The only difference is I hard coded the the setting as string instead of using environment variables. What's weird is it seems to be creating a new CosmosClient for every test. If that is the case there shouldn't be any difference in running the tests in parallel since every run has it's own CosmosClient instance.
This is very weird. I've checked with my second laptop and another one from my colleague, it has hanged in both cases. I've updated to 3.5.0 and .NET Core 3.1, results are the same, a hang.
As far as I understand xUnit execution engine, each test should have a different instance of the CosmosDB repository. However, xunit reuses process and for example, static data might be shared between test fixtures. I thought it might be the issue.
I've started to think if it might be related to geo-location or database settings, somehow. I use west-europe region, created with default settings, without any replication.
Would you be interested in some call and talk about it? I can also try to setup a VM on Azure with repro.
This seems very similar to #1043 . @ealsur is working on a fix that will hopefully fix this issue too.
Hey @j82w
I've downloaded 3.5.1 and unfortunately, it doesn't fix the issue. Since you wasn't able to reproduce the issue on your machine, I tried to reproduce it insice docker container. Can you try to follow the steps in https://github.com/lukasz-pyrzyk/CosmosDbHangWithParallelTests#3-repro-with-docker and see if you can reproduce the hang?
Based on how you are using the Repository, shouldn't it be in a Class Fixture instead of the constructor?
Based on how you are using the Repository, shouldn't it be in a Class Fixture instead of the constructor?
It is a simple repro. In real app Repository
comes from Microsoft.AspNetCore.Mvc.Testing
and it's managed by WebApplicationFactory<TStartup>.Services
.
Also see xunit/xunit#864
I understand that sync-over-async it's not recommended. However, I have a feeling that SDK v3
has a strange behavior. Moving from Lazy<T>
to AsyncLazy<T>
makes the tests passing, but I'm not sure if this is the correct fix.
Why?
Let's take a loon on the factory method from the Lazy<T>
in the Repository
class
public Repository(Settings settings)
{
_settings = settings ?? throw new ArgumentNullException(nameof(settings));
_cosmosClient = new CosmosClient(settings.Endpoint, settings.Key);
_clientFactory = new Lazy<Database>(() =>
{
var database = CheckIfDatabaseExists(settings).ConfigureAwait(false).GetAwaiter().GetResult();
return database;
});
}
private async Task<Database> CheckIfDatabaseExists(Settings settings)
{
var database = _cosmosClient.GetDatabase(settings.DatabaseId);
var read = await database.ReadStreamAsync();
if (read.StatusCode == HttpStatusCode.NotFound)
{
throw new Exception($"CosmosDB database {settings.DatabaseId} does not exist!");
}
return database;
}
CheckIfDatabaseExists
is synchronized with .ConfigureAwait(false).GetAwaiter().GetResult();
, so it's synchronization context shoudn't be an issue. However, when it is called by multiple threads from xUnit it hangs.
The problematic line is var read = await database.ReadStreamAsync();
Let's replace it with any other async operations, like Task.Delay
or IO operation with FIle.*Async
.
private async Task<Database> CheckIfDatabaseExists(Settings settings)
{
var database = _cosmosClient.GetDatabase(settings.DatabaseId);
await File.ReadAllTextAsync(@"C:\temp\a.txt");
await Task.Delay(100);
return database;
}
With this change, there is no hang and tests pass successfully. I can also replace it with the same logic code, but written in SDK v2.and it still works, code is in the branch https://github.com/lukasz-pyrzyk/CosmosDbHangWithParallelTests/blob/sdkv2/Tests/Cosmos/Repository.cs
Can you explain to me why await database.ReadStreamAsync()
behaves differently than previously mentioned async operations?
I also found it that repro depends on the amount of cores. For example, it does reproduce on i5 with 2 cores, but it works normally with i7 with 4 cores. That is why @j82w` was not able to reproduce it locally. Because of that, I have created a docker container and limited CPU to 1. It also reproduced the issue.
From the Xunit thread, it seems that Xunit will create a SynchronizationContext with a max parallel threads based on some parameter, possibly the # of cores. SDK 3.5.1 will, if a SynchronizationContext is detected, use Task.Run
to wrap the desired operation:
The only difference I see between the scenario we have in our tests (https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/SynchronizationContextTests.cs#L70) and your repro is that the lock is not happening on the SDK call itself, but through the Lazy.
The ReadDatabase operation is basically a call through HttpClient in the end, both the V2 and (now in 3.5.1) wrap the call in Task.Run
if a context is detected. Unless there is some further conflict between Xunit's SynchronizationContext behavior and another component (like HttpClient) that might have a difference in V3, I really can't see the source of the problem. V2 by default uses Gateway mode for Connectivity, while V3 uses Direct mode, maybe there is something in the TCP stack that is conflicting with Xunit's SynchronizationContext, you could try to set the V3 Connectivity to Gateway and see how it behaves?
OK, I have the issue.
I noticed that ConfigureAwait
was missing in the call method calling ReadStreamAsync
. Because of that, continuation wasn't able to start inside lazy
I have double-checked and I'm happy to say that issue was occurring on SDK 3.4.1, but it's fixed in 3.5.1.
Thanks for help
Looks closely related to the #1043
Describe the bug I've migrated two projects from SDK v2 to V3. After migration I noticed that tests are dramatically slower and they hang randomly. They all pass only when parallel execution is disabled.
To Reproduce I've submitted repro to the https://github.com/lukasz-pyrzyk/CosmosDbHangWithParallelTests. It has also branch with SDK V2, which works correctly.
Worth to mention. We usually use this concept when we want to initialize a single instance across the app and run some provisioning, like upsert of the stored procedures.
CheckIfDatabaseExists
looks like a root cause of the issue. It does an async database call and it's called from the LazyDuring the investigation, I have replaced the async call to the database inside
CheckIfDatabaseExists
withawait Task.Delay(100).ConfigureAwait(false);
with the following change, tests finish successfully, even in parallel.
Environment summary SDK Version: 3.4.1 OS Version: Windows 10
cc @ealsur @j82w