akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.69k stars 1.04k forks source link

Exception in PersistentShardCoordinator ReceiveRecover #3414

Closed joshgarnett closed 5 years ago

joshgarnett commented 6 years ago

Akka 1.3.5

This morning while making some provisioning changes, we ended up in a state where two single node clusters were running that pointed to the same database. After fixing the error and starting only a single node, the underlying akka code was failing to recover.

2018-04-22 15:28:45.148 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [40650] for persistenceId [/system/sharding/zCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/z#673731278] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:28:45.438 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [50633] for persistenceId [/system/sharding/oCoordinator/singleton/coordinator]
System.ArgumentException: Shard 78 is already allocated
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message

2018-04-22 15:29:07.714 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [11774] for persistenceId [/system/sharding/pCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/p#1560991559] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:29:10.192 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [3087] for persistenceId [/system/sharding/wCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/w#43619595] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

2018-04-22 15:29:15.210 ERROR PersistentShardCoordinator Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [11109] for persistenceId [/system/sharding/eCoordinator/singleton/coordinator]
System.ArgumentException: Region [akka://AkkaCluster/system/sharding/e#1963167556] not registered
Parameter name: e
   at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e)
   at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Persistence.Eventsourced.<>c__DisplayClass92_0.<Recovering>b__0(Receive receive, Object message)

My expectation is in the case of two nodes attempting to own the same data, that one would eventually see a journal write error as the journal sequence number would not be unique, and that the ActorSystem would then shut itself down. On recovery it should always be able to get back into a consistent state.

In our case, it was caused by a user error, but this could easily occur in the case of a network partition where two nodes claim to own the same underlying dataset.

Aaronontheweb commented 6 years ago

Going to look into this while I'm at it with https://github.com/akkadotnet/akka.net/issues/3455

Aaronontheweb commented 5 years ago

The issue is that the PersistentShardCoordinator doesn’t save or recover its state in the correct order - the ShardHomeAllocated messages that are tripping the exception during recovery should only be persisted after a ShardRegionRegistered message (shards belong to shard regions.)

Three possible causes of this:

  1. There’s a fair bit of async code inside the PersistentShardCoordinator I’m still untangling - it’s possible a race condition could cause this if the actor were trying to Persist its events out of order. I’m still looking into that possibility. Technically, the actor should never even be asked to host a shard until its region gets created first. I doubt this is the issue, but I can’t rule it out 100%.
  2. I’m wondering if the data we write to Akka.Persistence when we save the sharding snapshot is accurate. I took a look through the serialization code and I’m a little suspicious that it’s persisting the region state correctly. That would also cause this issue: the data was saved, but not in the correct format.
  3. Last issue is the Akka.Persistence implementation itself: if the journal isn’t saving or replaying events for the PersistentShardCoordinator in the correct order, that would certainly cause this.

Going to eliminate number 2 first since that's the simplest - will look into the others next.

Aaronontheweb commented 5 years ago

Manually verified the output of this spec:

https://github.com/akkadotnet/akka.net/blob/e15b935b59e739220f28cb197bb191211e912bd4/src/contrib/cluster/Akka.Cluster.Sharding.Tests/ClusterShardingMessageSerializerSpec.cs#L54-L78

Can vouch for its accuracy - the sharding serializer appears to be working correctly.

Aaronontheweb commented 5 years ago

This is probably the issue https://github.com/akkadotnet/akka.net/issues/3204

Going to create some reproduction specs and then see where things go.

Aaronontheweb commented 5 years ago

Working on a fun reproduction of this using an actual integration test against SQL Server spun up via docker-compose https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro

izavala commented 5 years ago

I was able to reproduce this issue on my end with with my copy of the above project: https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro.

And received the same failing to recover error message:

[ERROR][03/18/2019 23:34:59][Thread 0003][[akka://ShardFight/system/sharding/fubersCoordinator/singleton/coordinator#329445421]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [113] for persistenceId [/system/sharding/fubersCoordinator/singleton/coordinator] sharding.shard_1 | Cause: System.ArgumentException: Shard 23 is already allocated sharding.shard_1 | Parameter name: e sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e) sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message) sharding.shard_1 | at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)

I've attached the data from the EvenJournal database in hopes to find more information on what is causing this behavior. Journal.zip

Aaronontheweb commented 5 years ago

@izavala I'll deserialize the data gathered from the repo here and see what's up - that should paint a clearer picture as to what's going on.

Aaronontheweb commented 5 years ago

Wrote a custom tool using Akka.Persistence.Query using the dataset that created this error: https://github.com/Aaronontheweb/Cluster.Sharding.Viewer

Attached is the output. Haven't analyzed it yet, but this is the same data from @izavala's reproduction.

Shard-replay-crash-data.log

Aaronontheweb commented 5 years ago

Worth noting in these logs: no Snapshots were ever saved for the PersistentShardCoordinator under this run - it logged 30-34 "ShardHomeAllocated" messages per run using our reproduction app. The default Akka.Cluster.Sharding settings have us only take a snapshot once every 1000 journaled entries.

Aaronontheweb commented 5 years ago

So the logs we've produced confirm that #3204 is the issue - the exception in recovery only occurs when it's the same node with the same address trying to deserialize its own RemoteActorRefs each time. The issue doesn't occur when the node reboots using a new hostname when we tear down our Docker cluster and recreate it.

Aaronontheweb commented 5 years ago

Moving this to 1.4.0 - changes are too big to put into a point release. We're going to need to make a lot of changes to the serialization system for IActorRefs to complete this.

heatonmatthew commented 5 years ago

Hey cool, I've run into this one too. Still in a prototype phase but it was on my mind for issues to address in moving to a more production preparation phase.

@Aaronontheweb Since you're making serialization system changes, just a heads up that with your netstandard2.0 update in #3668 the difference between Framework and Core disappear. See my commit referencing the issue for the code that removes the difference.

Aaronontheweb commented 5 years ago

I've been able to verify via https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro/pull/10 that https://github.com/akkadotnet/akka.net/pull/3744 resolves this issue. I'm note done with #3744 yet - still need to make sure this works with serialize-messages and so on, but we're getting there.

Aaronontheweb commented 5 years ago

This is now resolved as of #3744

Caldas commented 3 years ago

Hey guys, since this issue has been fixed I recommend updating README at https://github.com/petabridge/akkadotnet-cluster-workshop, since at end of it still point to this issue as an active one