Closed joshgarnett closed 5 years ago
Going to look into this while I'm at it with https://github.com/akkadotnet/akka.net/issues/3455
The issue is that the PersistentShardCoordinator
doesn’t save or recover its state in the correct order - the ShardHomeAllocated
messages that are tripping the exception during recovery should only be persisted after a ShardRegionRegistered
message (shards belong to shard regions.)
Three possible causes of this:
PersistentShardCoordinator
I’m still untangling - it’s possible a race condition could cause this if the actor were trying to Persist its events out of order. I’m still looking into that possibility. Technically, the actor should never even be asked to host a shard until its region gets created first. I doubt this is the issue, but I can’t rule it out 100%.PersistentShardCoordinator
in the correct order, that would certainly cause this.Going to eliminate number 2 first since that's the simplest - will look into the others next.
Manually verified the output of this spec:
Can vouch for its accuracy - the sharding serializer appears to be working correctly.
This is probably the issue https://github.com/akkadotnet/akka.net/issues/3204
Going to create some reproduction specs and then see where things go.
Working on a fun reproduction of this using an actual integration test against SQL Server spun up via docker-compose
https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro
I was able to reproduce this issue on my end with with my copy of the above project: https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro.
And received the same failing to recover error message:
[ERROR][03/18/2019 23:34:59][Thread 0003][[akka://ShardFight/system/sharding/fubersCoordinator/singleton/coordinator#329445421]] Exception in ReceiveRecover when replaying event type [Akka.Cluster.Sharding.PersistentShardCoordinator+ShardHomeAllocated] with sequence number [113] for persistenceId [/system/sharding/fubersCoordinator/singleton/coordinator] sharding.shard_1 | Cause: System.ArgumentException: Shard 23 is already allocated sharding.shard_1 | Parameter name: e sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.State.Updated(IDomainEvent e) sharding.shard_1 | at Akka.Cluster.Sharding.PersistentShardCoordinator.ReceiveRecover(Object message) sharding.shard_1 | at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
I've attached the data from the EvenJournal database in hopes to find more information on what is causing this behavior. Journal.zip
@izavala I'll deserialize the data gathered from the repo here and see what's up - that should paint a clearer picture as to what's going on.
Wrote a custom tool using Akka.Persistence.Query using the dataset that created this error: https://github.com/Aaronontheweb/Cluster.Sharding.Viewer
Attached is the output. Haven't analyzed it yet, but this is the same data from @izavala's reproduction.
Worth noting in these logs: no Snapshots were ever saved for the PersistentShardCoordinator
under this run - it logged 30-34 "ShardHomeAllocated" messages per run using our reproduction app. The default Akka.Cluster.Sharding settings have us only take a snapshot once every 1000 journaled entries.
So the logs we've produced confirm that #3204 is the issue - the exception in recovery only occurs when it's the same node with the same address trying to deserialize its own RemoteActorRef
s each time. The issue doesn't occur when the node reboots using a new hostname when we tear down our Docker cluster and recreate it.
Moving this to 1.4.0 - changes are too big to put into a point release. We're going to need to make a lot of changes to the serialization system for IActorRef
s to complete this.
Hey cool, I've run into this one too. Still in a prototype phase but it was on my mind for issues to address in moving to a more production preparation phase.
@Aaronontheweb Since you're making serialization system changes, just a heads up that with your netstandard2.0 update in #3668 the difference between Framework and Core disappear. See my commit referencing the issue for the code that removes the difference.
I've been able to verify via https://github.com/Aaronontheweb/AkkaClusterSharding3414Repro/pull/10 that https://github.com/akkadotnet/akka.net/pull/3744 resolves this issue. I'm note done with #3744 yet - still need to make sure this works with serialize-messages
and so on, but we're getting there.
This is now resolved as of #3744
Hey guys, since this issue has been fixed I recommend updating README at https://github.com/petabridge/akkadotnet-cluster-workshop, since at end of it still point to this issue as an active one
Akka 1.3.5
This morning while making some provisioning changes, we ended up in a state where two single node clusters were running that pointed to the same database. After fixing the error and starting only a single node, the underlying akka code was failing to recover.
My expectation is in the case of two nodes attempting to own the same data, that one would eventually see a journal write error as the journal sequence number would not be unique, and that the ActorSystem would then shut itself down. On recovery it should always be able to get back into a consistent state.
In our case, it was caused by a user error, but this could easily occur in the case of a network partition where two nodes claim to own the same underlying dataset.