Proposed design to address Polling Sensor HA

vivekdatar commented 4 years ago

1 Problem Description

See https://github.com/StackStorm/st2/issues/4301# for problem description. Refer to https://docs.stackstorm.com/reference/ha.html for StackStorm HA design and block diagram.

Summary is as follows.

StackStrom polling sensors are not HA aware. Polling sensor will poll without any knowledge of HA, meaning if there is another polling sensor doing the same job in HA mode then both will continue to poll independently. This is an issue when running HA in active-active mode. If customers split the polling sensors in HA mode then the duplicity is avoided, and system runs fine, at the cost of single point of failure for each polling sensor.

In active-active HA mode duplicate polling sensors can create duplicate events, which can cause issues. Goal of this design is to make sure that in active-active HA mode only one polling sensor polls at a given time; other polling sensor will start polling if first one fails for some reason. Note that there can be multiple polling sensors, out of which only one should poll at any given time.

2 Design Considerations/Assumptions

Design assumes that st2sensorcontainer and st2actionrunner are in same node (i.e. VM) in HA configuration.

A central broker entity is needed to co-ordinate multiple instances of polling sensors running on different blueprint boxes. This broker should allow sensors to register themselves with the broker. In that sense the broker is similar to zookeeper.

3 Block Diagram

Screen Shot 2020-05-14 at 9 54 38 AM

3.1 Sensor Arbitrator

Sensor Arbitrator (st2sa) is the “brain” that controls all the HA activated polling sensors. It provides the following functionality

Allows HA enabled polling sensors to register themselves
Manages heart-beat messages from all HA enabled sensors
Chooses “operation” sensor from list of all the registered sensors that perform same polling function
Informs all the agents regarding their current status (“operational”, meaning they should poll, “standby”)
Chooses different operational sensor if the current operation sensor fails. Three subsequent heartbeat misses implies faiure.
Performs adequate logging of all its operations for debugging/logging purposes

This is new code development.

3.2 HA Enabled Polling Sensors

Current code for polling sensors will be modified for HA operation as follows. Note the code will be written such that non HA mode operation remain exactly same.

Polling Sensor will first register with st2sa
Upon successful registration it will start periodic heart-beat messages.
Polling Sensor will be informed by st2sa regarding its status. This information can come as a separate message, or as part of keep alive response. Each keep alive response will contain the status information
Once the sensor is instructed to poll (status = “operational”) it will start the poll. No change in polling functionality
Upon restart the polling sensor will go through registration process once again. Meaning it will not start polling till registration is complete and it receives “status=operational” from st2sa.

This is modification of existing polling sensor code. The modification should be modular such that there should be minimal to zero impact on existing code. We cannot introduce regression issues into existing polling functionality in non HA mode.

For example, in python code, the HA should be implemented as separate set of functionality, which gets invoked only when HA is enabled. And existing code should be changed minimally.

4 Lock v/s No Lock Tradeoff

Problem can be solved by either locking or without locking. Locking scheme will involve locks to be acquired by individual polling sensors, with some kind of lease timeout. After lease timeout kicks in the lock has to be reacquired. This scheme is not recommended for the following reasons

Each sensor has to perform lock, which means more logic in sensors. This “spreads” the logic across 100s or maybe 1000s of sensors. Easier if this logic is in a central place like st2sa
Locking is inherently difficult to debug in timing situations that happen “at scale” and invariably in larger deployments. I have personally seen several such locking issues, which we could never reproduce in lab. These issues only happen in field, and take long to debug
Locking scheme is difficult to scale

Therefore locking is not advisable. When most of the critical logic is in one place (st2sa) it is easier to trace logs and debug issues. Further, st2sa can be further enhanced by writing logic to sweep through all sensors, ensuring their health. Also in case Controller Box dies and some other box takes over, heatbeat mechanism allows for sensors to register with new st2sa, and correct polling operation would resume in short period of time (although new st2sa may elect a different sensor than previous one)

5 Timing Diagrams

5.1 Diagram 1

SensorHA

Explanation of state and status “state”: is maintained internally by sensor, it cannot be programmed externally, it can be read by other entities.

“status”: is programmable element for sensor. It can be programmed by Arbitrator.

Sensor Registration & Response

As sensors comes up, it detects HA mode
Sensor sends Register Sensor message to Arbitrator
Once it received Success, it start sending Keep Alive messages periodically
Each Keep Alive Response contains Sensor “status” information. Sensor compares the requested status with current state and acts accordingly. Asynchronous Status Change by Arbitrator
Arbitrator can decide to send an asynchronous status change message any time without waiting for Keep Alive
This message will send new status information to Sensor
Sensor will act on it and change its state if need be

Polling Sensor HA Design Document.docx

arm4b commented 4 years ago

@vivekdatar Thanks for creating the proposal discussion with design!

To get other's attention to this, can you please adjust your first message and include all the content/diagrams as a formatted text instead of pointing to the document to download?

See https://github.com/StackStorm/discussions/issues/14 and other Issues https://github.com/StackStorm/discussions/issues for example.

vivekdatar commented 4 years ago

Thanks @armab. Appreciate your help.

arm4b commented 4 years ago

@vivekdatar Thanks for putting such well-organized document together for the Sensors HA. It's been an unresolved problem for a long time.

A few questions:

1) If we introduce new sensor arbitrator service into stackstorm architecture (st2sensorcontroller) which will track the child sensors statuses and their aliveness, how we scale out these st2sensorcontrollers if there is such need? Or is it expected to run only single sensorcontainer? In HA mode all st2 components could be scaled out, see https://docs.stackstorm.com/reference/ha.html. I'm wondering how mechanism may look like in this case.

2) Besides of that, thanks for touching the locking. Many st2 components already rely on coordination backend for distributed locking and more HA capabilities. See st2.conf: https://github.com/StackStorm/st2/blob/master/conf/st2.conf.sample#L103-L109

[coordination]
# Endpoint for the coordination server.
url = None
# True to register StackStorm services in a service registry.
service_registry = False
# TTL for the lock if backend suports it.
lock_timeout = 60

The interesting part here is service_registry. Looks like we already implemented some useful primitives for that in the past.

See PR: Register services in service registry during the service bootstrap phase #4548 with coordinator heartbeats which detects if group members (services) are still members of the group or not.

This means heartbeats & list of alive members/services described in the original proposal is already part of the st2 core. It relies on current [coordination] functionality via tooz (distributed system helper) library https://docs.openstack.org/tooz/latest/ and needs a backend like Redis, Zookeeper, etcd, etc which is described in https://docs.stackstorm.com/reference/ha.html#zookeeper-redis. Check the PR above, there is a lot of information in there and description that might help us.

Having the Group Membership capabilities, is the Leader Election relevant next topic https://docs.openstack.org/tooz/1.57.1/tutorial/leader_election.html for sensors HA? I'd like us to explore if we can leverage existing already built primitives and tooz functionality for consistency.

@Kami any insight from you on this topic as someone who implemented this and @m4dcoder @nmaludy who reviewed it?

m4dcoder commented 4 years ago

@vivekdatar Thank you for the proposal. This is well thought out. We have an outstanding plan to introduce service discovery, specifically to track action runners (and other components). The action runner instance will register with the service discovery on boot up and regular health check during runtime. Some of this is already implemented per @armab above. Is it possible to revisit the design for the polling sensor here to share the same service registry and identify the gaps in current implementation? Thanks again for your contribution.

StackStorm / community