StackStorm / community

Async conversation about ideas, planning, roadmap, issues, RFCs, etc around StackStorm
https://stackstorm.com/
Apache License 2.0
8 stars 3 forks source link

StackStorm HA, scaling, k8s #16

Open lakshmi-kannan opened 6 years ago

lakshmi-kannan commented 6 years ago

StackStorm HA enhancements

### st2sensorcontainer

* Improve reliability of st2 sensors by having a native way for a node to take over a sensor if
  an owner fails. (Sensor replicas with partition map makes sense?)

### st2rulesengine

* Currently, the timers run in st2rulesengine and is not HA compatible. We should fix this.
    - An option is to view st2timersengine as a separate process and offload responsibility
    of uptime to kubernetes

### st2actionrunner

* Policies require HA redis/zookeeper for coordination. Verify with redis.

### st2resultstracker

* Verify you can spin multiple instances with new callback design for mistral
* Update documentation https://docs.stackstorm.com/reference/ha.html#st2resultstracker

### st2notifier

* Documentation is confusing - https://docs.stackstorm.com/reference/ha.html#st2resultstracker
* Verify we can spin more than one st2notifier without Redis/ZK for coordination
    - What happens when we don't have coordination service?

### Common

* We should do chaos monkey testing by hupping some processes and see how things react

* We should figure out new version deployments - Blue/Green or rolling?
    - I propose blue/green for Kubernetes based deployments but this maybe hard to do for non-kubernetes deployments

* We should figure out how to upgrade packs
    - I have no idea how we are going to do this

StackStorm scaling

CustomerX wants a peak of 1200 concurrent automation (we don't know if they mean individual actions or workflows). They want to be able to run 20000 automations per day which is around 14 executions/min. I am pretty sure we can 14 executions/min but we definitely won't be able to do 1200 concurrent mistral workflows. We should test this.

StackStorm HA in Docker

StackStorm HA Deployment

K8s in various clouds/on-prem

AWS

- When EKS (Hosted Kubernetes) is out, I think people would prefer to use that. I hope it is integrated
  with Amazon secret store and Amazon Parameter store.
- If EKS isn't an option, we should look into kops for kubernetes deployment and look at solving secrets and config ourselves
- ECS is not an option (Since colocation would be a problem)
- Read https://news.ycombinator.com/item?id=15808065 (Especially comments)

GCP

- GCE (Managed Kunernets, most attractive, some advanced customization not possible)
- kops works with GCP

Azure

- ACS
- No kops https://github.com/kubernetes/kops/issues/3957
- Should we even do this?

On-prem

- Kubespray (uses ansible under the hood - https://kubernetes.io/docs/getting-started-guides/kubespray/)
- We should definitely leave this to community and see if there are any takers for kubespray
arm4b commented 6 years ago

Right now our image size is 390M - That is way too large for us to pull and deploy.

Should we build a package of st2 that works with apk package manager in alpine linux? Do we need this or can we just git clone the source, install the pip dependencies? We don't need dh-virtualenv because we are now inside Docker? Can we quickly figure out the image size?

390M is a good enough size for a container, especially considering:

vagrant@stackstorm:~$ du -hs /opt/stackstorm/
499M    /opt/stackstorm/

I think alpine is more for fun thing, will save ~50MBs or so of the image size.

^^ Feels like not that much sense to improve the artifact size part.

arm4b commented 6 years ago

Overall a lot of useful info here :+1:

Considering there are areas to improve or even re-architect in StackStorm itself for HA, - it may be a big/long story to play it well.

LindsayHill commented 6 years ago

it may be a big/long story to play it well

I see it playing out over multiple releases. Build cookie cutter, iterate, improve, scale test, improve, etc...

Kami commented 6 years ago

I agree with @armab - this is also a great opportunity to try to simplify the architecture and reduce number of services, where possible.

One thing we talked about in the past when we discussed new workflow engine was st2resultracker and st2notifier.

We should get rid of at least one of them and potentially rename / split notifier (scheduler, notifier). If the Mistral callbacks change turn out to work well, there should be very little need for st2resulttracker left (and we would still have CLI tool which would allow user to rectify executions, if needed).

blag commented 6 years ago

Conversion progress about Mistral -> Orquestra migrations across the repositories:

PRs to st2:

PRs to orquestaconvert:

According to the original Slack discussion, the AWS pack also needed to be converted, but I haven't been able to find any Mistral workflows in that. ~Are we intending to migrate the two ActionChain workflows?~ No.

LindsayHill commented 6 years ago

Are we intending to migrate the two ActionChain workflows?

Don't worry about those right now. Main thing is Mistral workflows. Action Chains can be done later.