dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.05k stars 2.02k forks source link

Client stream subscribers are not being removed #2505

Closed mhertis closed 7 years ago

mhertis commented 7 years ago

We are using Orleans 1.2.3 and SimpleMessageStreamProvider with FireAndForgetDelivery=true and AzureTableStorage for PubSubStore.

Our stream has one producer (grain) and one consumer (client). Consumer code is quite unstable, meaning that it quite frequently crashes and subscribes itself to the stream.

After few days we are observing the following exception:

ERROR 102203 Storage.AzureTableStorage.1 100.104.2.49:11111] !!!!!!!!!! Error from storage provider during WriteState for grain Type=Orleans.Streams.PubSubRendezvousGrain Pk=grn/716E8E94/0000000000000000000000000000000006000000716e8e94+sms_GWD1:cd4bb0089f054000-downstream-0x432DE56F Id=GrainReference:grn/716E8E94/00000000+sms_GWD1:cd4bb0089f054000-downstream Error=

Exc level 0: System.ArgumentOutOfRangeException: Data too large to write to Azure table. Size=983105 MaxSize=983040 Parameter name: GrainState.Size at Orleans.Storage.AzureTableStorage.CheckMaxDataSize(Int32 dataSize, Int32 maxDataSize) at Orleans.Storage.AzureTableStorage.ConvertToStorageFormat(Object grainState, DynamicTableEntity entity) at Orleans.Storage.AzureTableStorage.d32.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Orleans.Core.GrainStateStorageBridge.d6.MoveNext() Exc level 0: System.ArgumentOutOfRangeException: Data too large to write to Azure table. Size=983105 MaxSize=983040 Parameter name: GrainState.Size at Orleans.Storage.AzureTableStorage.CheckMaxDataSize(Int32 dataSize, Int32 maxDataSize) at Orleans.Storage.AzureTableStorage.ConvertToStorageFormat(Object grainState, DynamicTableEntity entity) at Orleans.Storage.AzureTableStorage.d32.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Orleans.Core.GrainStateStorageBridge.d6.MoveNext()

When the new item is produced on the stream we can observe in logs Consumer is no longer active - permanently removing Consumer but record remains in the table storage. <-- Need to verify this claim, it is possible that no item is produced on the stream while client consumer is subscribing and subscribing and ...

jason-bragg commented 7 years ago

The size of the subscription data is limited to the underlying storage, and it looks like you're hitting the limit for azure table storage. Given your scenario though, 1 producer, 1 consumer, this should not be happening.

You mentioned that consumers sometimes have issues and needs to subscribe again, in these cases do you resume the existing subscription or create a new one?

While configured for explicit subscriptions, every call to SubscribeAsync will create a persistent subscription, that exists even if the consuming grain goes away. So if a grain calls subscribe on every activation, it will create many subscriptions, which is probably not your intention.

If one is using explicit subscriptions, it is probably best to check for existing subscriptions prior to creating a new one. This can be done by calling IAsyncStream.GetAllSubscriptionHandles(). If a subscription handle is found, one can resume consuming from that subscription by calling StreamSubscriptionHandle.ResumeAsync, or remove the subscription by unsubscribing (StreamSubscriptionHandle.UnsubscribeAsync).

mhertis commented 7 years ago

As you said, this should not be happening. =D

We are not resuming existing subscriptions on our client. That seems to be the root of our problem. We would need to address this.

Thinking forward, can one get into this error if no item is pushed to the stream and client is reconnecting to the cluster and creating new subscriptions? Are dead consumers cleaned up from the pubsub store if no item is pushed to stream?

jason-bragg commented 7 years ago

can one get into this error if no item is pushed to the stream and client is reconnecting to the cluster and creating new subscriptions.

Whether a producer generates events or not should not effect this. Subscription management should be mostly orthogonal to producer behavior.

There are, however, some peculiarities related to streaming from the client. If a client loses connection to the cluster, all of the existing producers and consumers on that client will be removed from the pubsub system, even if the client does not unsubscribe. For this reason I would suggest using no fewer than 3 gateways when using streaming on the client. This is not strictly related to this issue, but it does concern subscription management from an Orleans client.

mhertis commented 7 years ago

Ok, thanks for clarification. I't seems that I was little to quick on posting this issues.

I was relying on behavior, you pointed out, that all producers/consumers care removed from pubsub when client disconnects. The caveat is that in our case, client did not disconnect (at first I assumed it did) before resubscribing to the stream. Idempotent subscriptions will help in such case.

mhertis commented 7 years ago

I'm also observing exceptions below, are these somehow related?

System.ArgumentNullException: Value cannot be null.
Parameter name: source

Server stack trace: 
   at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable`1 source, Func`2 predicate)
   at Orleans.Streams.PubSubRendezvousGrain.<RegisterConsumer>d__16.MoveNext()
jason-bragg commented 7 years ago

@mhertis What storage provider are you using?

mhertis commented 7 years ago

AzureTableStorage

mhertis commented 7 years ago

If a client loses connection to the cluster, all of the existing producers and consumers on that client will be removed from the pubsub system, even if the client does not unsubscribe.

For this reason we need to know client connection status, some kind of a signal/event on connection change (i.e. is lost, reestablished, ..) so that client can resubscribe to the stream.

There was also short conversation on this topic on gitter.

sergeybykov commented 7 years ago

@mhertis, please open a separate issue for client disconnect notification.

mhertis commented 7 years ago

@sergeybykov #2551 =D

sergeybykov commented 7 years ago

Thank you!

sergeybykov commented 7 years ago

Can we close this one now?

mhertis commented 7 years ago

Yes we can, thank you.