dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.08k stars 2.03k forks source link

Deadlock using AdoNetStorageProvider in OrleansSqlUtils 1.4.0 #2745

Closed Daemon6666 closed 7 years ago

Daemon6666 commented 7 years ago

I get deadlock using AdoNetStorageProvider when Silo starts. Code from RelationalStorage in ReadAsync method: var ret = ExecuteAsync(query, parameterProvider, ExecuteReaderAsync, selector, cancellationToken, commandBehavior).GetAwaiter().GetResult().Item1 I use postgresql database. Maybe it occurs because of ActivationTaskScheduler is used

veikkoeeva commented 7 years ago

@Daemon6666 PostgreSQL does not have a storage provider yet. If you indeed mean a storage provider, most likely the reason to your problem is that.

On a further note, it's unclear to me why it would deadlock instead of throwing an error. It should throw here. It uses .ConfigurationAwait(false) and at that place the Task has been awaited already. There might be case the PostgreSQL provider runs in the same thread as the caller (see the little fix for MySQL in the RelationalStorage code), but it shouldn't have bearing in this case, I believe.

Daemon6666 commented 7 years ago

I implemented postgresql scripts for grains state storage. They let me use AdoNetStorageProvider (invariant=npgsql)

veikkoeeva commented 7 years ago

@Daemon6666 It's difficult tell what the reason could be. It seem the storage provider for SQL Server and MySQL are functional. PostgreSQL has membership provider, which is functional, and one could conclude from the PostgreSQL ADO.NET drivers work OK. Have you tried the PostgreSQL membership provider and see if it works for your?

Is the deadlock you refer from the database and it manifests there when referencing the result?

Daemon6666 commented 7 years ago

Why did you use ExecuteAsync(...).GetAwaiter().GetResult().Item1 on async method instead off await ExecuteAsync(...); //It is certain the result is already ready here and can be collected straight away. Having a truly asynchronous result collection //IAsyncEnumerable<TResult> or equivalent method ought to be used. Taking the result here without async-await saves for generating //the async state machine. var ret = ExecuteAsync(query, parameterProvider, ExecuteReaderAsync, selector, cancellationToken, commandBehavior).GetAwaiter().GetResult().Item1;

It seems that because of using of GetAwaiter().GetResult() we get deadlock while executing OpenAsync on connection. Regarding fix for MySQL it's not suitable for us for the same reason. That fix just handle synchronous executor but not OpenAsync part.

using (var connection = DbConnectionFactory.CreateConnection(invariantName, connectionString)) { await connection.OpenAsync(cancellationToken).ConfigureAwait(false);

veikkoeeva commented 7 years ago

@Daemon6666 So your code deadlocks in opening the connection and not in selecting the result? The opening of the result happens here.

The reason, as stated in the comment, for that has been to save an asynchronous state machine invocation. The results are collected here, which is collecting them to a list. Then they're passed back using Task.FromResult(ret);. This is to satisfy the asynchronous function signatures in the calls. The reasoning for this can be seen also in this SO answer.

What to do: I'm afraid I can't gleanse the exact reason to the deadlock from the information you've provided, but you can change the code to use await and see if it helps. Run the tests and great if it does, please submit a PR in that case.

Going a bit further with this code, in newer .NET, it'd be prudent to use IAsyncEnumerable and ValueTask.

Maarten88 commented 7 years ago

I also sometimes see these deadlocks with SQL Server. It seems to occur mostly during development after I have recreated the storage database, when the Storage table has very few records (my grain project does database creation/migration on startup, including grain storage tables and data).

The error seems to get less frequent when the Storage table grows and the system has been running for some time.

The usage pattern of my code is that it always writes a couple of grains of the same class together at the same time. The error I get is:

Error from storage provider during WriteState for grain Type=[namespace].Grains.BotDataStoreGrain Pk=*grn/505BE4EA/0000000000000000000000000000000006000000505be4ea+user:56800324:emulator:2c1c7fa3-0x6D48FE94 Id=GrainReference:*grn/505BE4EA/00000000+user:56800324:emulator:2c1c7fa3 Error=

Exc level 0: System.Data.SqlClient.SqlException: Transaction (Process ID 56) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
at System.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__174_0(Task`1 result)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Orleans.SqlUtils.RelationalStorage.<ExecuteReaderAsync>d__13`1.MoveNext()

I'm using the latest 1.4.1 Orleans version. I had "Asynchronous Processing=true;" in the connectionstring but removing it made no difference.

veikkoeeva commented 7 years ago

@Maarten88 It doesn't look like your problem is the same as the originally reported in this thread. The original problem is a custom database.

What you describe is what can happen. The reason is, if you look how the insertion is done, is that INSERT takes heavy locks to ensure no duplicates will be inserted. The problem alleviates the more there are records (two might be sufficient), since the locks can use an index and one part of the index is the ID.

The duplicate detection here is a problematic part. Orleans can in some circumstances try to activate two times the grain (same type and ID), which could lead to state being inserted twice with a different value. It's not sure then what should happen. I.e. how the grains will deactive and re-read the state and if so, which one of the states to take. To prevent this situation altogether leads to the heavy locking you see there (the other duplicate grain just plain fails to write state).

This is excarbarated to some extent that if the grain type and ID were used directly as the unique key, it'd slow down the operations, make the index much heavier and more difficult to maintain. As you can imagine of adding NVARCHAR(n) to an index, twice. Instead the solution has been to take the hash of them both and use them as the index.

What do do: 1) When dealing with storage, error is a possibility, so prepare for it by retrying and if it doesn't help, have some other action to take. 2) Consider what are the edge cases here and if it were OK to have multiple states with the same type and only take the latest. I'm not entirely sure if this could lead to a situation two grains are activated, one is deactivated, but the one left running holds and etag it can't use. Maybe it's a bening race and re-reading the state will help. 3) Try to lighten the INSERT part of the query, but as you see, the problem goes away when the table grows, as effectively the INSERT can check the locking need from the index, which basically means locking something like one row if even that.

We can also chat more in Gitter, just ping me.

jdom commented 7 years ago

Indeed, this was also deadlocking a lot with SQL Server tests too (see #2804, #2790 and finally #2860 on INSERTs), just a heads up in case you didn't see those, since this test was fixed (I'd say relaxed so it doesn't fail often, although the implementation can still deadlock if it gets tons of inserts in parallel) after this issue was opened. This is the first time I see this issue, sorry for not chiming in earlier.

BTW, did you consider contributing the PosgreSQL scripts for storage providers? it might help you catch issues by opening them up to the community. Just a thought.

veikkoeeva commented 7 years ago

I should also add to the

adding NVARCHAR(n) to an index, twice. Instead the solution has been to take the hash of them both and use them as the index.

part the thing here is that the index isn't unique in two cases: duplicate activation and now that the index is narrowed using hashing. If the NVARCHAR(n) parts were used in the index, one could fail duplicate write with a primary key violation and avoid some locking. Though the index would be very heavy one and would limit the ID and type length of the grains to the maximum allowed by the DB indexing system. Now that the type and ID are hashed, there's a change of collisions, so they can't be unique, so can't rely on primary key violations as the duplicate preventation system.

It's also not feasible to allow the DB to generate the primary keys (as in adding a surrogate identity column), since that'd mean Orleans would have to keep in memory those keys and assign them on the fly to the states. It's doable and would allow narrower indexes and less locking, but would make sharding both horizontally and vertically (and heterogenously) difficult, perhaps even impossible. Something I think shoud be done by just adding a tuple of connection strings and mapping them to { deployment ID, grain type, grain ID tuples with some rule.

Instead, the idea for queries is that SELECT and UPDATE dominate, in which case there's basically no locking and the queries hit the right row rather fast, like if the table were a Dictioary<Tuple<deploymendid, graintype, grainid>, TState>>. Even the heavy initial INSERTs cease to be a problem (should be at least) once the table grows a bit and the locs can use the index. Note that the above doesn't mean the locking couldn't be made lighter, but it'd require careful planning.

The idea with that indexing, and the table, that they're easy to maintain (not needed at all?) even when there are a lot of rows, as in billions and should function just as fast then too. Though disaster recovery times are maybe the limiting factor, but one takes care of that with compression, file groups and things liek that. The system should be simple enough to allow any plan.

This is written elsewhere too, but maybe it doesn't hurt to add this information here too.