akkadotnet / akka.net

Canonical actor model implementation for .NET with local + distributed actors in C# and F#.
http://getakka.net
Other
4.69k stars 1.04k forks source link

Akka.Cluster.Sharding v1.5: Remember entity store did not respond, restarting shard #5860

Closed Aaronontheweb closed 1 year ago

Aaronontheweb commented 2 years ago

Version Information Version of Akka.NET? v1.5.0 Which Akka.NET Modules? Akka.Cluster.Sharding, state-store-mode=ddata, remember-entities=on

Describe the bug

Running a brand new cluster with no prior history stored for remember-entities, but still seeing a large number of these:

Remember entity store did not respond, restarting shard

Possibly due to:

Timestamp
2022-04-20T20:37:24.3004664Z
System.InvalidOperationException: Async write timed out after 00:00:05
   at Akka.Cluster.Sharding.Shard.<>c__DisplayClass73_0.<WaitingForRememberEntitiesStore>g__WaitingForRememberEntitiesStore|0(Object message)
   at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message)
   at Akka.Actor.ActorCell.ReceiveMessage(Object message)
   at Phobos.Actor.PhobosActorCell.ReceiveMessage(Object message)
   at Akka.Actor.ActorCell.Invoke(Envelope envelope)

To Reproduce

Deploy

Links to working reproductions on Github / Gitlab are very much appreciated

Expected behavior

RE should "no-op" and start up without any issues if none are available.

Actual behavior

Crashed.

Aaronontheweb commented 2 years ago

cc @zbynek001

zbynek001 commented 2 years ago

I'm using persistent mode, so i didn't test the ddata scenario too much. Will have a look if i can find something. Might need some more info. Do you have some custom config overrides? How big is the cluster size?

Aaronontheweb commented 2 years ago

Cluster size was 130 nodes, but I was also able to reproduce at 20 nodes.

HOCON configuration didn't have anything custom really - using Akka.Persistence.Azure but otherwise using Akka.Cluster.Sharding defaults with state-store-mode=ddata and remember-entities=on.

It looked to me like the system was still trying to retrieve some of its remember-entities data through a combination of Akka.Persistence and DData. I'll re-run the sample today and capture a log dump from Seq. I might even OSS the sample since it's just something I'm using to stress test Akka.Cluster for a Petabridge customer.

Aaronontheweb commented 1 year ago

This appears to have been a transitory issue with the busy-ness of that system, not a bug with the software.