StackStorm / community

Async conversation about ideas, planning, roadmap, issues, RFCs, etc around StackStorm
https://stackstorm.com/
Apache License 2.0
8 stars 3 forks source link

HA - Banter #12

Open lakshmi-kannan opened 6 years ago

lakshmi-kannan commented 6 years ago

I think we have to quickly make the following decisions:

  1. Do we really want to implement sensor failover? https://docs.stackstorm.com/reference/sensor_partitioning.html

In all modes other than dynamic mode, the user precisely defines exactly where he wants to run a sensor. The file mode is actually sweet because a st2 admin exactly defines the entire topology of nodes and what sensors each should run. If monitoring matches the deployment, we can tell users to rely on monitoring and fail over would happen manually by user intervention. Though the time to recover is a problem here, the alternate is to implement a true HA story where replicas of sensor nodes are configured to run the same sensors (in dynamic mode, this is based on hash and not user explicit configuration). In this case, we would need to implement leader election in which all sensor nodes participate to elect leader for each hash range. Network partition could happen and therefore we could still end up with two nodes running exact same sensors. This would result in duplicate events and unless policies are carefully designed, we could end up with duplicate actions or workflows. Maybe we can add host info to every trigger instance emitted and do some kind of windowing to detect duplication but this all requires us to build:

The mean time to recover in node loss situation will be sub-seconds. Given that not a lot of our users actually write custom sensors, I wonder if this is something we should solve with so much engineering effort?

  1. Do we want to support timers in HA mode? Similar to sensors partitioning, timers also would need to be partitioned. So in terms of HA reliability, we'd need the exact items we need for sensor here too. There are a lot of distributed cron systems out there and maybe we can provide hooks to post a webhook from them? Building a distributed cron inside st2 is an interesting idea but also adds to complexity.
m4dcoder commented 6 years ago

@lakshmi-kannan We plan to implement delay/sleep command in orchestra. When that is called, the execution is put on hold until the delay/sleep expires. My thought is to redesign the action execution scheduler because we also need to revisit priority execution. The redesign will need a timer of sort to handle delay/sleep commands. Bringing this up because this may be relevant to the timer HA discussion here.

lakshmi-kannan commented 6 years ago

@Kami and I had a discussion and we agreed on the following internally:

m4dcoder commented 6 years ago

@lakshmi-kannan @Kami For the orchestra implementation, I had to add related logic to st2notifier. My plan was to rename it to something like st2actioncontroller. If you're splitting it to st2policymanager and/or st2scheduler, please keep my changes in mind.

arm4b commented 6 years ago

Sensors will be partitioned explicitly by user by using a file partitioner (https://docs.stackstorm.com/reference/sensor_partitioning.html#file) which lists nodes and the sensors each node should run explicitly. This has several advantages over dynamic hash range based partitioner in terms of easy model to grasp and operate. Plus, operator can precisely know where a sensor is running making it easier to debug.

@lakshmi-kannan @Kami

File partitioner + listing sensors explicitly, - is it a hard requirement to use or other strategies https://docs.stackstorm.com/reference/sensor_partitioning.html can still work?

While the files might be easier in implementation/code, it's not easier in configuration for the end user if we talk about the scaling. Explicitly assign sensor X to node Y is a big PITA and another burden in terms of configuration. Things like dynamic auto-scaling makes it very hard with this approach. Definitely not HA-friendly or K8s-friendly.

I was thinking about implementing the st2.conf generation for each sensor container based on https://docs.stackstorm.com/reference/sensor_partitioning.html#hash. With Helm charts + ranges we can do app generation/template st2.conf based on number of total sensorcontainer nodes. This way user just specify number of st2sensorcontainer replicas he needs to run, - no further configuration required.

Thus, while we make something easier for us, it could be harder for our users. Any serious drawbacks with https://docs.stackstorm.com/reference/sensor_partitioning.html#hash?

Kami commented 6 years ago

@m4dcoder

@lakshmi-kannan @Kami For the orchestra implementation, I had to add related logic to st2notifier. My plan was to rename it to something like st2actioncontroller. If you're splitting it to st2policymanager and/or st2scheduler, please keep my changes in mind.

That's good to hear and sounds like we actually think alike. I think we can probably call it st2scheduler, this way we can split it in a single service instead of two (since policy related stuff is also basically (re)scheduling).

@armab

We talked about this and the most serious drawback is complexity of the implementation (and maintenance) and troubleshooting / debugging this (it reduces the visibility to the operator because partitioning is dynamic and it's not always immediately clear on which node particular sensor is running).

Dynamic hash based partitioning would also require us to implement dynamic failover and re-balancing. Failover itself is not that big of a deal and can be solved with leader-election, but dynamic rebalancing on failover is complex and requires implementing gossip protocol which is not trivial.

Instead of going with static assignment, we could still go with hash based assignment and simply just do the failover, but not the rebalancing (e.g. on failover of node A which handles range Y..X, random node based on leader election process would take over those sensors).

This could potentially cause uneven load on the nodes, but I think it's a good compromise which keeps complexity down (it took Cassandra and other projects many years and many, many person work hours to get this right and stable / robust and there is no way we can do it in a robust manner in a short time frame. I'm sure we could do "something", but it wouldn't be robust. And I'd rather not support something then support it with many limitations and edge cases, specially in HA scenario).

lakshmi-kannan commented 6 years ago

Instead of going with static assignment, we could still go with hash based assignment and simply just do the failover, but not the rebalancing (e.g. on failover of node A which handles range Y..X, random node based on leader election process would take over those sensors).

I think this is a reasonable trade off option that we did discuss. I am +1 to this personally. Complete rebalancing would add too much complexity to get right.

lakshmi-kannan commented 6 years ago

Okay, I finally have a sense of what technology to use to build our HA story. We need a backend that gives us 3 things

  1. Locking
  2. Leader election
  3. Service Discovery

Both etcd and zk will work as a backend. If you want to do etcd vs zk vs consul comparison, see https://medium.com/@Imesha94/apache-curator-vs-etcd3-9c1362600b26, https://coreos.com/blog/performance-of-etcd.html. This comment is the most useful https://news.ycombinator.com/item?id=6367070. (Both @Kami and I worked with Russell and Philips, they are both good engineers).

I personally favor etcd because it's more relevant in the k8s world than zk. I am pretty sure our scaling needs for locking, leader election and service discovery aren't anywhere close to stressing zk or etcd out.

I have operated ZK cluster before and it's a freaking nightmare. I have no experience with operating etcd but I understand a optimistic locking on top of KV system and I definitely do understand how raft works. Therefore, I am leaning towards etcd here too.

Next comes the question of client libraries. In this kazoo (ZK python client library, https://github.com/python-zk/kazoo) is more mature than etcd's (https://github.com/kragniz/python-etcd3). But the etcd client library is fairly small thanks to good APIs and RPC exposure. ZK has a more layered approach with curator recipes on top of ZK primitives and therefore the client library is just hard to understand.

Now what do we pick?

Things to consider for us

tooz supports leader election for zk but it doesn't support it for etcd. See https://docs.openstack.org/tooz/1.57.1/tutorial/leader_election.html https://github.com/openstack/tooz/blob/master/tooz/drivers/etcd3.py#L334

It's not too hard to implement the missing functionality in tooz (For example, we could copy code from https://gist.github.com/tfrench/3ee209f1d533dc73a7e63298177a02d9) but contributing to openstack is a freaking pain.

We could completely switch to just using etcd3 client library and ignore tooz but this would mean we will not be flexible with the choices of backend and 3.0 may not be deployment compatible with < 3.0. I don't see this as a pain. Number of HA deployments right now is so small.

My proposal:

  1. Declare etcd3 as the backend for all coordination stuff and discovery for st2 HA
  2. Just get rid of tooz and move to etcd3 client library.
warrenvw commented 6 years ago

I'm also in favor of etcd over zookeeper. In part because I've got no experience with zookeeper. I have a little experience running an etcd cluster, and Raft is something I've studied too. I like how etcd has better scaling, but as you mention, scale needs for st2 would likely not stress out either zookeeper or etcd. When it comes to usability, reading that operating a zookeeper cluster is a "freaking nightmare".. only adds to my gut feel that etcd is best. Your proposal sounds good to me.

arm4b commented 6 years ago

Instead of going with static assignment, we could still go with hash based assignment and simply just do the failover, but not the rebalancing (e.g. on failover of node A which handles range Y..X, random node based on leader election process would take over those sensors).

I think this is a reasonable trade off option that we did discuss. I am +1 to this personally. Complete rebalancing would add too much complexity to get right.

For me this tradeoff compares to cutting the bread with fork or with a spoon. (even though, in practice it's a bit easier with fork) I mean I still wouldn't be able to compose it as a good stackstorm-ha Helm app, because of the requirement to list Sensors explicitly from the user's side.

I'd say if both proposed are same bad for K8s env, - just use the easiest one to avoid complex things you mentioned, - we can always redo the solution if new requirements arise (startup MVP per say). Will bring more configuration burden on the user, - but OK

arm4b commented 6 years ago

@lakshmi-kannan @Kami Continue this sensor partitioning topic and to perfectly fit the strategy you proposed before with file-based approach, if we talk about K8s "kosher way" (cc @dzimine) of HA, - the good practice would be adding ha-focused option to bind only 1 sensor per service and exit st2sensorcontainer if that specific sensor failed.

Proposal

At the moment, currently many sensors are ran in the background, forked from the main service. Proposal is to have a hard mode (st2.conf) to run only 1 sensor that threated as a service. Sensor fails, - service dies, - container is stopped.

Real problem example:

Before, if some of the sensor(s) failed/killed, - we needed to restart entire st2sensorcontainer service to revive a single failed sensor in the list and TBH that was a really painful practice of running StackStorm in near-prod environment, based on #ops-infra and #opstown experience.

Follow the K8s best practices:

This 1 enhancement brings really a lot goodness:

So if we can bring 1 sensor = 1 container mode to simple static partitioning, - I'd vote for it since in this way one story complements and enhances another with a consistent valuable end result, giving enough advantages over disadvantages from the user's/operation/k8s perspective.

It also may simplify your work, throwing out part of the responsibility to K8s cluster which already performs real self-healing well. That's the heart of Kubernetes, - wrap a simple app and make it HA/reliable/resilient with K8s powers.


By mentioning "kosher way", the @dzimine's favorite :) , - here ^^ is the promised (still minimal) list why 1-process-per-container has so much sense in K8s, - the topic raised before several times. So much useful goodness comes from this one concept.

warrenvw commented 6 years ago

I'm 100 x 👍 what you just mentioned regarding following k8s best practices... cuz the more 12FA we make ST2 (HA), the 😀 I'll be.

Kami commented 6 years ago

@lakshmi-kannan and others

Re locking and election - it looks there actually is an etcd tooz driver which supports leader election (https://github.com/openstack/tooz/blob/master/tooz/drivers/etcd3gw.py).

So knowing that, I would prefer us to go with that approach first. I plan to test it out today to see if it's indeed working and report back.

EDIT: It actually doesn't implement all that is needed for leader election (it's missing watch_elected_as_leader but most other group methods are implemented).

@armab

As far as sensor partitioning goes - I think 1 sensor per pod / container is a great idea and follows the best practices :+1:

This would simplify our life a lot since we would outsource some of the fail over to kubertnetes :)

At the moment, currently many sensors are ran in the background, forked from the main service. Proposal is to have a hard mode (st2.conf) to run only 1 sensor that threated as a service. Sensor fails, - service dies, - container is stopped.

We already have most of that code in place, we would just need a simple light wrapper service around our sensor wrapper script (in fact, that's how I usually tend to test single sensor locally).

Kami commented 6 years ago

As far as etcd support in tooz goes - as mentioned above, etcd3gw driver implements most of the group joining and watching methods so implementing support for leader election is quite straightforward - https://github.com/openstack/tooz/compare/master...Kami:etcd3_gw_implement_leader_election?expand=1

I did some basic testing and it seems to work fine (which should indeed be the case, since leader election stuff is mostly based around a lock primitive + heartbeating and the lock implementation should be solid).

Only "downside" of grpc gateway driver is that grpc API is in beta and available in recent releases (I think that should be fine since this API has been there for a while already).

lakshmi-kannan commented 6 years ago

@Kami etcd3gw is fundamentally no different to what I linked with etcd3. gw is supposed to use the JSPON-gRPC gateway. I think you should make the changes to etcd3 which also has the partial membership methods implemented. Just FYI.

@armab With the 1 container per sensor idea, Like kami said, we'll then push the HA story to kubernetes but unfortunately for us, we have to support non k8s HA deployments too and this fail over story wouldn't work. This is why we thought about HA in isolation of delivery mechanism. So I guess we have to make a call on whether we want to only support HA in k8s. For me, this is not possible because of our customers but let's check with @LindsayHill and @dzimine

Kami commented 6 years ago

Since I already started on gw backend, I will try to finish that and get it merged - https://review.openstack.org/#/c/575383/3

And later one, if needed, I can make same / similar changed to non gw driver.

arm4b commented 6 years ago

With the 1 container per sensor idea, Like kami said, we'll then push the HA story to kubernetes but unfortunately for us, we have to support non k8s HA deployments too and this fail over story wouldn't work. This is why we thought about HA in isolation of delivery mechanism. So I guess we have to make a call on whether we want to only support HA in k8s.

@lakshmi-kannan With this approach we at least make it usable in Kubernetes and can try HA faster with smallest implementation price. It doesn't bring regression to existing methods like VMs/bare-metal, - they still work as before via https://docs.stackstorm.com/reference/ha.html

You guys told us to iterate faster. We even don't know the real problems of running StackStorm in HA mode, - we never tried it (sic!), but we wrote that HA doc. I think having earlier at least 1 fully codified and working StackStorm HA solution for Enterprises follows the idea you guys mentioned the other day. It also follows your idea to do less work.

We can still improve the HA story for non-K8s with enhanced failover later, having real experience of running StackStorm in K8s ourselves. But here & now for initial K8s we definitely require 1 service = container and stackstorm built-in failover is excessive because it's off-loaded to K8s. That's the main point of K8s and users in that environment want to control the behavior from the K8s side which is more flexible/smart/reliable.

lakshmi-kannan commented 6 years ago

I think now I understand what you mean. I agree with you that is a faster approach and it doesn't take away anything from us doing failover in non k8s world later. So with this in mind, I think I agree with your 1sensor per container approach. It is the greatest idea we have heard. So great job!

If we split the timer process into own process, can we then do similar thing for timers in k8s? Unfortunately, we won't be able to claim any HA in non container world though without leader election but with the tooz changes kami is making, we can document our approach for non k8s deployments.

dzimine commented 6 years ago

So I guess we have to make a call on whether we want to only support HA in k8s. For me, this is not possible because of our customers but let's check with @LindsayHill and @dzimine

I am OK with it. I agree with the direction in general - let's have A (one) HA deployment no matter which reasonable way, out of near-infinite-many. Put a stress on it, get the numbers, post results in a good article (aka, current state). Regroup and decide 1) what to fix and 2) how to change the approach. Yes it won't cover some customer's case but there's no one-size-fits-all HA. Betting on K8s is right. Di Data will be upset but we shall not/rarely/almost-never let one big user skew the product direction.

If this ^^ takes a month, it's good. If it takes 3 months, let's find a diff way that delivers some incremental results in a month.

PS.

we never tried it (sic!), but we wrote that HA doc. I think having earlier at least 1 fully codified and working StackStorm HA solution

For the record, and in protection of Manas' good name: Manas had set up the HA cluster and tried everything he wrote in HA doc. It didn't take him forever and we have been benefiting from it for over 2 years now (as opposed of not having anything HA).

LindsayHill commented 6 years ago

I believe that k8s is right for our "reference, codified" HA implementation, and that we are right to use k8s capabilities where it makes sense here.

That said, I don't think it means that we only support k8s on HA. People can still run HA using their own setup, but we will give them reference docs, not a codified implementation. They will have to make more decisions, possibly implement some local workarounds, etc. I don't think anything we're planning precludes that, does it?

lakshmi-kannan commented 6 years ago

I don't think anything we're planning precludes that, does it?

It does. Since we are going to rely on k8s to handle failover, non container deployments have to think about this and it may not be a simple workaround. But if it comes to some big customer wanting to deploy in VMs, we can take this challenge on and the work we do on st2 code won't block us from doing something then (I hope). I haven't talked to @Kami about the sensor container changes he wants to do.

LindsayHill commented 6 years ago

My read of the existing HA documentation is that we say "here's how you can partition sensors", but it doesn't seem to offer any automated failover capabilities. We leave that up to the reader to re-partition sensors if a sensor container dies.

Under the proposed model, will users not be able to partition sensors? Or will they still be able to do things manually?

lakshmi-kannan commented 6 years ago

Under the proposed model, will users not be able to partition sensors? Or will they still be able to do things manually?

Proposed model does no changes to any of what exists today. It's almost a noop in st2 code other than making some code changes to run a single sensor standalone inside a sensor container. Right now, a single st2sensorcontainer is forking a bunch of sensor processes (so st2sensorcontainer is equivalent to k8s. It handles sensor failures and restarts them). This won't be the case going forward. How we orchestrate to run multiple sensors to run in non container deployments is yet to be determined but I am hoping @Kami already has thought about this. A sensor per container is mandatory in k8s world to address failover. The concept of partitioning won't change from today's model (assigning what sensors to run where). Sorry for the longer answer but I think it's important for everyone to think about this more.

m4dcoder commented 6 years ago

We are now saying - bring your backend.

@lakshmi-kannan @Kami even if we stick w/ tooz, it's more of a benefit/abstraction to us than for our users. since we are revisiting this for HA, let's be opinionated about which backend the user should use for locking, leader selection, etc. because it will be tested and so we know it will work.

Kami commented 6 years ago

Lakshmi asked me for a list of task (aka MVP) we need to need to be able to benchmark StackStorm in HA mode and get some basic numbers:

StackStorm work

Keep in mind that this list is subject to change once we actually start testing / hammering StackStorm (it's likely new issues which we will need to address will appear).

Kubernetes / Ops work

(@arma will provide a more more detailed and conrecte list)

Other

Since we don't want those numbers just once and we don't want to do a one of, I believe one of the important tasks for this milestone / deadline should also be "continous bencmarking and tracking of performance" over time.

In the past, I used codespeed for that (https://github.com/tobami/codespeed). It's nothing fancy, but it works fine (it's an API + web ui to which you submit metrics over time / for each commit and the WebUI visualizes it over time).

We should spin up a codespeed instance (ops / terraform task) and work on a benchmarking workflow which uses kubernetes work and spins up a new HA kubernetes cluster on each master merge which then submits various metrics to codespeed instance.

In addition to all the other hard number we want for a particular configuration (number of webhooks per time unit we can handle, number of python runner actions, numer ob rules processed, etc.) we should also track those basic metrics:

This should also allow us to more easily and automatically catch performance regressions.

On a related note - we should also come up with a list of all the metrics we want / need (I think lakshmi / dmitri already posted something on Slack, but we should also recap it here somewhere).

arm4b commented 6 years ago

From the https://stackstorm.slack.com/archives/C4CAXVA05/p1529508232000025, here is minimal list of tasks (and blockers) to get the quick stats from the K8s team perspective (ordered by priority):

Note that this stage is slightly different to "prod/ship" stage and main point to get some stress-testing numbers quickly as we hack the K8s cluster live without worrying how that would'be shipped to users or is it codified, prod-ready or not.

1) ~Single-sensor mode (https://github.com/StackStorm/st2/pull/4179)~ (done in st2) 1.1) We need to codify/config that in K8s/Docker (k8s) 2) st2 core script to run pack download + virtualenv: https://github.com/StackStorm/k8s-st2/issues/21 (st2) 3) Any etcd/timers changes in st2 core for rules/coordination, @lakshmi-kannan @kami know better (st2) 4) Write the stress-tests scenarios and communicate clearly what configuration is needed from K8s side to run that (st2) 4.1) It will take some time for K8s team to adopt these tests + configuration for existing setup (K8s) 5) Quick hack live K8s cluster to setup graphing and show CPU + Memory pod utilization, during stress-tests to know the bottlenecks (K8s) 6) st2 pack install to throw an error in K8s env to block users trying to break their st2 setups https://github.com/StackStorm/k8s-st2/issues/20 (st2) 6.1) Changes on K8s to support that (K8s) 7) Setup MongoDB, RabbitMQ, PostgreSQL and Mistral in cluster/HA mode (K8s) 7.1) Big task, will postpone that and run dep. services in non-HA mode for the first stage, as agreed with @warrenvw. 8) We'll need /health endpoints https://github.com/StackStorm/st2/issues/4020 shipped in 2.9 for sure (st2)

I'd say that quickly hacked K8s cluster on AWS that @warrenvw did setup is working more or less to get something running on top of it. In general @warrenvw did our quick hack part and now the turn is on st2 side-changes like etcd and timers which is unknown for us.

Considering that we can't try the st2 changes due to reasons that we don't want to merge breaking changes/features in master before 2.9dev, I'd like to propose start codifying stress-testing scenarios before etcd/timers implementation and split TESTS scenario into 2 stages: 1) Stress-test actions, runners, workflows, sensors (stress-test what we have, - we can do it now) 2) Stress test everything when more HA improvements are implemented in st2: actions, runners, workflows, sensors, rules, policies (stress-test everything, for real)

Even without timers and etcd st2 core changes we can test actions + workflows and sensors. Having this part codified it will help K8s team to move forward because it will uncover the requirements we don't know yet. Having preliminary stress-tests running (even without rules) will help us to understand the configuration/bottleneck/env issues we don't know yet in Docker/K8s.

Knowing this ^^ will keep the K8s team busy with the configuration to run those preliminary stress-tests on HA cluster, until st2 team get ready with the next iteration like etcd, timers, rules and more advanced stress-tests.

arm4b commented 4 years ago

This Discussion issue was moved from the private Discussions to make it available to public community.

Here, while we did the very first iteration to make Sensors work in K8s environment with the failover, more work to make the Sensors HA is still needed. This discussion lists some ideas around that.

Related to #11