Disaster Recovery mode - Githubissues

arnoldsi-payo commented 7 months ago

Hi, I want to present a scenario and understand how Elsa handles this scenario.

I have an instance in Region A (currently pointed by DNS) and one in Region B. Both instances are in Active-Active mode. I'm using Mongo as persistence for Elsa, and the data is continuously replicated from Region A to Region B.

Is it possible to tell Elsa on instance in Region B not to pull workflows, to be. in standby mode, and to activateall traffic to Region B in case of DR?

Thank you

glime-ai[bot] commented 7 months ago

Based on the information provided in the context, Elsa Workflows supports distributed hosting, which means it can run in an environment with multiple nodes. To handle scenarios like the one you described, where you have instances in different regions and want one to be in standby mode, you would typically use a combination of load balancers, DNS configurations, and possibly the configuration of Elsa itself to manage which instance is active and which is on standby.

Elsa does not have a built-in feature to set an instance to "standby" mode directly. However, you can achieve a similar result by controlling the traffic directed to each instance. For example, you could configure your DNS or load balancer to route all traffic to the active instance in Region A and only switch to Region B in case of a disaster recovery (DR) situation.

In the event of a DR situation, you would change your DNS or load balancer configuration to direct traffic to the instance in Region B. Since you are using MongoDB with data replication between regions, the instance in Region B should have access to the latest workflow data and be able to take over.

Additionally, you can configure Elsa to use a service bus for distributing workflow execution commands across instances. This way, when an instance becomes active, it can start processing the commands from the service bus. You can find more information on distributed hosting and service bus configuration in the Elsa documentation.

For a more detailed and specific solution, you would need to consult the Elsa documentation, community forums, or the Elsa GitHub repository for guidance on setting up such a scenario.

arnoldsi-payo commented 7 months ago

Nice Bot response :) The instances are always active, even tho there is no traffic to Region B, Because DB is always replicated, I'm afraid of Region B instance will take workflow and start process it even if region is not "active"

sfmskywalker commented 7 months ago

Hey there, the short answer is that there's no support for this scenario. But that doesn't have to be the end of it, either. Although Elsa will not handle routing traffic, we can try and think of a way to have the ability to enable/disable the workflow server. The challenge here might be that there's no "workflow engine" root type that does everything. Instead, the entire ASP.NET Core application is the engine, which might be connected to message queues, listening for HTTP Requests, and custom code from users that trigger workflow execution (e.g. from custom controllers).

Of course, whichever the stimulus may be, ultimately all routes lead into a workflow runtime layer, which could decide to essentially return early in case the application is configured to be in standby mode.

A potential issue with this approach, however, is that messages on a message queue are still ACKed and therefore removed from the queue, instead of sitting there until the workflow engine does become active. To handle this, the "is active" checking logic would have to sit in the message consumer itself and not acknowledge the message. Better yet, the consumer itself should simply stop consuming.

Similarly, given the various other "sensors" such as HTTP requests, Timer events, job processors and user-specific code, this "is active" logic might have to be implemented in all of those, and repeated for any new sensors added to the system in.

Perhaps there's an elegant design to handle this, but I haven't found it yet and any input is appreciated!

arnoldsi-payo commented 6 months ago

@sfmskywalker, thank you for your response. I would like to know if the library has any architectural design. For example, what is the flow when a new workflow is created? Who puts it into RabbitMQ? Is it a background job or a manual operation?

elsa-workflows / elsa-core

Disaster Recovery mode #4905