dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.05k stars 2.02k forks source link

PubSubRendezvous Grain multiple activations during Production\Staging style deployment. Streams rendezvous design problem? #2418

Closed centur closed 2 years ago

centur commented 7 years ago

I raised this question in gitter chat and seems that it's valid, so I'd open this to start a discussion. The case is :

When anyone is doing "Cloud Service-style deployment" (which involves full deployment of a new version to the environment called Staging and then vip-swap which transparently replaces Production deployment with staging), there is a high chance of multiple activations of PubSubRendezvous grain for a ExplicitGrainBasedAndImplicit Stream type.

The data for this grain is not DeploymentId dependant, so when new cluster starts in staging slot - it will write or clear the State in OnActivateAsync. As per current code - it's conditional, but given that this staging grain just read production activation data about producers and consumers - I suspect it simply removes all of them as all existing production producers and consumers are not reachable across two deployments ( as we're making silo ports available only to other machines in the same cluster, and Cloud Service goes even further and makes them completely isolated afaik).

As a result of this write\clear - production activation now holds incorrect etag, fails with exception on next write\clear and re-reads the state that was written by a staging activation. Cycle repeats - staging fails, crashes, re-reads, writes, breaks etag of prod activation and so on. Valid Etag "ping-ponging" between production and staging activations.

Few questions:

  1. Is this a valid scenario of what happens in the code ? As per my understanding and what we observed in our prod environment - two Orleans clusters were doing something similar to described above (it was on v1.2.3 though, going to test it on 1.3.1 but latest state hints me that it will be the same behaviour) - killing and restoring silos on prod and staging. We managed to break the ping-pong cycle of death by explicitly shutting down staging and restarting hot production environment.

  2. Is it technically can be solved? 2.1 Without introducing breaking changes ?

  3. If answer to 2. is yes - I'm curious how it can be solved as I can't make up any imaginary solution to fix it and preserve the streams behaviour :

From my understanding we either need to use Deployment ID in the PubSub grain key - which makes stream subscriptions scoped to the given deployment and breaks one of the Streams key concept that once subscribed - listener would always receive changes (no matter how many deployments happened between subscription and event time marks).

Or we need to prevent Staging activation from clearing dead producers\consumers (and writing the state) which breaks this grain exact purpose..

centur commented 7 years ago

Similar problem may appear in the Reminders domain (not 100% sure if the reminder state is written on each tick) as if reminder causes state changes - whoever, Staging or Prod, would 'tick' first - will make the changes to the 'tick-recipient' grain and that grain's Etag will be in ping-pong loop between 2 deployments too. Although this is not so dramatic as it's effectively "grain call double delivery" problem and can be partially solved on the recipient side making the behaviour idempotent and doing ReadStateAsync() before handling a tick.

sergeybykov commented 7 years ago

This is sort of by design today - two clusters are unaware of each other, and only "synchronize" via storage. The ongoing in-place upgrade work is meant to improve the experience.

For persistent stream providers, there's a way to start them in the staging cluster in the no-delivery mode, so that the pulling agents won't deliver events, and would only fill their caches. There is a way to send commends to stream providers to start/stop delivering events, for a smooth hand-off of event delivery from production to staging. @jason-bragg can provide more details.

From my understanding we either need to use Deployment ID in the PubSub grain key - which makes stream subscriptions scoped to the given deployment and breaks one of the Streams key concept

That would be against the model of virtual streams.

Similar problem may appear in the Reminders domain

Correct. For the same reason - reminders are service scoped, not deployment.

jason-bragg commented 7 years ago

@centur What stream provider are you using?

Is this a valid scenario of what happens in the code ?

Yes. Your description of the problem, per my understanding, is correct. For persistent stream providers, which I assume you are using, the producers are cluster specific so the pubsub states between the two clusters may conflict as long as both are running.

Is it technically can be solved? Without introducing breaking changes ?

Minimizing the time that the two clusters are running is the most common mitigation for this. I understand that this is a mitigation, not a solution. :/ As Sergey stated, there are ways to stop or start a persistent stream providers in a cluster, so depending on your needs, the staging cluster can start with streaming turned off to avoid the conflict. Once the clusters are swapped and the original cluster is shut down (or it's streaming stopped) streaming in the new cluster can be started.

jason-bragg commented 7 years ago

@centur Is your stream topology suitable (or can it be made suitable) for implicitly subscribed streams?

centur commented 7 years ago

@jason-bragg :) Well, yes, this is a mitigation as well as we can have downtime upgrades, which would be very short downtime like 10-20 mins, but it's not a solution.

With regards to other questions - we are mostly using implicit subscription which works well - it has no footprint in the storage at all :) But this particular case it has to be mixed - our websocket clients start to listen on the stream for an incoming data as they open a particular page\document and unsubscribes as they close the doc.

I'm not yet sure how this design can be amended to use only implicit streams - we may need to introduce another bus like service bus and implicitly re-stream there while websocket clients would listen to Service Bus instead of Orleans stream, but that defeats the purpose of the stream then and we are too late in the project to quickly introduce such change.

Yes we are using Azure Queue Stream provider for persistent streams, thinking about switching to something else but with no clear plans for this yet.

I'm keen to learn about how we can control the pulling service in a cluster and what would be the best way to configure such Start\Stop switch. I believe it must be in silo intialization, not after that, right ? Any hints and examples are welcome :)

jason-bragg commented 7 years ago

TL;DR: Any persistent stream provider can be initialized in a ‘stopped’ state by configuring it with the setting StartupState=”AgentsStopped" (in code enum PersistentStreamProviderState. AgentsStopped). One can start or stop a stream provider at runtime by making a call to the management grain. Example:

        var mgmt = GrainClient.GrainFactory.GetGrain<IManagementGrain>(0);;
        await mgmt.SendControlCommandToProvider(providerType, providerName, (int)PersistentStreamProviderCommand.StartAgents);
    or
        await mgmt.SendControlCommandToProvider(providerType, providerName, (int)PersistentStreamProviderCommand.StopAgents);

These calls will fan out to all active silos in the cluster and start/stop the provider of the specified type/name. Examples of this can be found in PullingAgentManagementTests.

Long version: A while back we had an internal customer that requested the ability to send control signals to providers as well as stop and start stream providers. To meet this need we allowed any provider to be ‘controllable’ by inheriting from IControllable. Any provider inheriting from this interface can be sent simple control signals consisting of an integer and an object. This framework was mostly used in testing to inject failures into providers, but was also used to allow persistent stream providers to be started, stopped and queried at runtime, by sending them the following signals.

public enum PersistentStreamProviderCommand
{
    None,
    StartAgents,
    StopAgents,
    GetAgentsState,
    GetNumberRunningAgents,
}

Since persistent stream providers use adapters that can be developed by customers, the persistent stream provider also allows commands to be directed to controllable adapter factories, or adapters via the commands in the following ranges.

public enum PersistentStreamProviderCommand
{
    AdapterCommandStartRange = 10000,
    AdapterCommandEndRange = AdapterCommandStartRange + 9999,
    AdapterFactoryCommandStartRange = AdapterCommandEndRange + 1,
    AdapterFactoryCommandEndRange = AdapterFactoryCommandStartRange + 9999,
}

The azure queue adapter and adapter factory does not support any specific commands, though one can develop custom decorators over them to respond to commands. For examples of controllable adapters please see tests ControllableStreamGeneratorProviderTests.

While these capabilities were developed and tested, and are used in some test scenarios, they, to my knowledge, have not been used in production environments. This is also an alarmingly leaky abstraction, hence we've not advertised this capability...

jason-bragg commented 7 years ago

wait...

our websocket clients start to listen on the stream for an incoming data as they open a particular page\document and unsubscribes as they close the doc.

So this means that the subscriptions need only exist as long as the websocket connection, correct? So they are cluster specific?

If the pub-sub information is in fact deployment specific, because it is tied to websocket connections which are deployment specific, we have more options.

One should be able to use an in-memory storage provider for the pubsub system, which will be cluster specific (and faster). The in-memory storage provider does have some vulnerabilities when it comes to failure, but those vulnerabilities have been significantly reduced in the 1.3.x versions.

If one wants the stronger reliability of azure storage, each deployment can be configured with a different table name, by deriving the table name from the deployment Id.

centur commented 7 years ago

We are using azure queues to solve 2 kind of problems - decouple send and deliver moments - so some intensive sender won't be able to create immediate spike for all consumers and consumers can process it in their own pace and a deadlock problem - simple message stream causes deadlocks, especially when we started to use more complex triggering and re-sending mesages.

But the advice about in-memory storage provider is actually may work well for us - our current scenarios are very heavily deployment scoped, so there is no point in using storage backed PubSubRendezvous grain.

Btw is there any way to have 2 different implementations of PubSubRendGrain in the cluster e.g. if we will need both modes - persistent and deployment scoped streams - can we just create a custom PubSubRendezvous, inherit from Orleans one and override the storage provider there or this scenario currently is not possible ?

jason-bragg commented 7 years ago

can we just create a custom PubSubRendezvous

The stream pubsub system is still an internal detail of the streaming infrastructure. It has multiple implementations internally, but has not been opened up to the public for customers to provide their own implementations. This is something I would like to see, but have not had the time nor demand for such a feature. While I don't suspect you will need a custom pubsub system, if you or your team determines that you do have such a need, I'd be quite happy to discuss some approaches. A user implemented pubsub system opens a great many possibilities.

While you asked about different PubSubRendezvous grains for per deployment or cross deployment needs, this is related more to the storage provider than the pubsub grain. At the moment, the pubsub system storage is global. Unfortunately, there is no way to configure it per stream provider.

jason-bragg commented 7 years ago

Simple message stream causes deadlocks

If the needs are for an ephemeral stream, but SMS fan-out is causing issues, you may consider tinkering with the MemoryStreamProvider. It's new and was primarily developed for development of persistent streams, so developers would not have to manage their own persistent queue, just to test their code. It is a persistent stream provider that keeps the messages in in-memory queue grains, decoupling send and delivery. These queue grains are not persisted, so messages will be lost during silo outages, but if data loss is not a major concern, the provider may be worth considering.

centur commented 7 years ago

@jason-bragg Thanks for hinting on PubSubRendezvous - we really don't need to have it's state and it can be in-memory only - if the grain dies - websocket clients would disconnect and must reconnect agan, which is expected and desired behaviour. Works safer now.

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.