PersistentShardCoordinator throws Microsoft.Data.SqlClient.SqlException

fscavo commented 2 years ago

Version of Akka.NET? 1.4.26 Which Akka.NET Modules? Akka.Cluster Akka.Cluster.Sharding

Akka.Cluster.Sharding.PersistentShardCoordinator reports this error:

One or more errors occurred. (Violation of UNIQUE KEY constraint 'UQ_AkkaPersistenceEventJournal'. Cannot insert duplicate key in object 'dbo.AkkaPersistenceEventJournal'. The duplicate key value is (/system/sharding/herdCoordinator/singleton/coordinator, 32)

and throws the following exception:

Microsoft.Data.SqlClient.SqlException: at Microsoft.Data.SqlClient.SqlConnection.OnError (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.TdsParser.TryRun (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.CompleteAsyncExecuteReader (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=2.0.20168.4, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5) at System.Threading.Tasks.TaskFactory1.FromAsyncCoreLogic (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Akka.Persistence.Sql.Common.Journal.BatchingSqlJournal2+d85.MoveNext (Akka.Persistence.Sql.Common, Version=1.4.21.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Akka.Persistence.Sql.Common.Journal.BatchingSqlJournal`2+d77.MoveNext (Akka.Persistence.Sql.Common, Version=1.4.21.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Akka.Util.Internal.AtomicState+d7`1.MoveNext (Akka, Version=1.4.21.0, Culture=neutral, PublicKeyToken=null) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e) at Akka.Util.Internal.AtomicState+d7`1.MoveNext (Akka, Version=1.4.21.0, Culture=neutral, PublicKeyToken=null)

Aaronontheweb commented 2 years ago

You might need to run https://github.com/petabridge/Akka.Cluster.Sharding.RepairTool here. This is likely caused by an interrupted shutdown of the Akka.Cluster.Sharding coordinator.

fscavo commented 2 years ago

Is this a bug or is there something we might not manage properly? @Aaronontheweb

Aaronontheweb commented 2 years ago

It's the latter - need to allow clean shutdowns of your cluster nodes to avoid this problem. Or you can change your state storage mode to DData, which doesn't have this issue.

Aaronontheweb commented 2 years ago

That being said, we ought to make this a better experience for users

to11mtm commented 2 years ago

@Aaronontheweb What might help is guidelines for coordinated-shutdown timeouts as well as sharding timeouts in docs.

I've found that with Sharding (especially with remember-entities=on), the closer you are to running 'max load' for a cluster, the longer it takes to do a migration on shutdown. i.e. if you are moving hundreds or thousands of actors and dozens of shards across multiple types of sharded actors, it may not hurt to shoot for a total coordinated shutdown timeout of a minute or more as well as longer coordinator timeouts on sharding. Don't forget to consider load shifts -during- a deploy. e.x. If you're rolling across 4 nodes, how likely is that Node 1 has had anything moved to it before you start shutting Node 2 down? (not very!)

I do know that with the Persistence.Linq2Db plugin, shutdown performance is overall improved (especially if multiple shards are shutting down.) Definitely better in most 'overload' scenarios (i.e. when you ignore the advice that 10 shards per node is a good max)

Aaronontheweb commented 9 months ago

I think the changes we introduced in Akka.NET v1.5 probably resolve this.

akkadotnet / akka.net

PersistentShardCoordinator throws Microsoft.Data.SqlClient.SqlException #5389