StackStorm HA, scaling, k8s

lakshmi-kannan commented 6 years ago

StackStorm HA enhancements

### st2sensorcontainer

* Improve reliability of st2 sensors by having a native way for a node to take over a sensor if
  an owner fails. (Sensor replicas with partition map makes sense?)

### st2rulesengine

* Currently, the timers run in st2rulesengine and is not HA compatible. We should fix this.
    - An option is to view st2timersengine as a separate process and offload responsibility
    of uptime to kubernetes

### st2actionrunner

* Policies require HA redis/zookeeper for coordination. Verify with redis.

### st2resultstracker

* Verify you can spin multiple instances with new callback design for mistral
* Update documentation https://docs.stackstorm.com/reference/ha.html#st2resultstracker

### st2notifier

* Documentation is confusing - https://docs.stackstorm.com/reference/ha.html#st2resultstracker
* Verify we can spin more than one st2notifier without Redis/ZK for coordination
    - What happens when we don't have coordination service?

### Common

* We should do chaos monkey testing by hupping some processes and see how things react

* We should figure out new version deployments - Blue/Green or rolling?
    - I propose blue/green for Kubernetes based deployments but this maybe hard to do for non-kubernetes deployments

* We should figure out how to upgrade packs
    - I have no idea how we are going to do this

StackStorm scaling

CustomerX wants a peak of 1200 concurrent automation (we don't know if they mean individual actions or workflows). They want to be able to run 20000 automations per day which is around 14 executions/min. I am pretty sure we can 14 executions/min but we definitely won't be able to do 1200 concurrent mistral workflows. We should test this.

StackStorm HA in Docker

Right now our image size is 390M - That is way too large for us to pull and deploy.
- Should we build a package of st2 that works with apk package manager in alpine linux?
  - Do we need this or can we just git clone the source, install the pip dependencies?
  - We don't need dh-virtualenv because we are now inside Docker?
  - Can we quickly figure out the image size?
Helm
- Leave this to community but make sure we have a helm chart https://github.com/StackStorm/st2-docker/pull/126

StackStorm HA Deployment

Ansible playbooks for HA deployment on bare metal/VMs
- Not much to think here other than building playbooks for one reference OS (Ubuntu 18.04)
  - BTW we should build OS packages for Ubuntu 18.04
- We have to think about secret configuration entries (in st2.conf and packs)
  - At a minimum, we should have a way to deploy st2.conf with secrets
Kubernetes story for deployment on cloud varies based on provider. We should also account for on-prem kubernetes deployments. We should figure out which cloud providers we want to address directly and for which ones we rely on community (should we rely on community?)

Common
- What kind of OS we want to run on top of?
  - Evaluate container OSes like CoreOS, RancherOS, DC/OS, Project Atomic, Ubuntu Snappy, ...
    - Decide if we really need a container host OS or should we just run on Ubuntu (We shouldn't really try to support multiple Host OSes)
      - We can use the same OS in both cloud providers and on-prem to control experience and support
      - Automated over-the-air updates
      - Yet another technology https://www.inovex.de/blog/docker-a-comparison-of-minimalistic-operating-systems/
    - I am leaning towards CoreOS because of the ability to spin instances in any cloud and etcd is natively available for service discovery. It would be a hard sell to ask Enterprises to run CoreOS in their data centers. Some probing here would be good.
- How are going to manage configurations?
  - This is a tricky one
    - People typically use environment variables.
    - People also try to use etcd/consul and use confd inside container https://github.com/kelseyhightower/confd
    - Amazon parameter store
- How are going to manage secrets?
  - AWS secrets store, vault (see aws-vault), https://kubernetes.io/docs/concepts/configuration/secret/https://kubernetes.io/docs/concepts/configuration/secret/ https://lyft.github.io/confidant/,

K8s in various clouds/on-prem

AWS

- When EKS (Hosted Kubernetes) is out, I think people would prefer to use that. I hope it is integrated
  with Amazon secret store and Amazon Parameter store.
- If EKS isn't an option, we should look into kops for kubernetes deployment and look at solving secrets and config ourselves
- ECS is not an option (Since colocation would be a problem)
- Read https://news.ycombinator.com/item?id=15808065 (Especially comments)

GCP

- GCE (Managed Kunernets, most attractive, some advanced customization not possible)
- kops works with GCP

Azure

- ACS
- No kops https://github.com/kubernetes/kops/issues/3957
- Should we even do this?

On-prem

- Kubespray (uses ansible under the hood - https://kubernetes.io/docs/getting-started-guides/kubespray/)
- We should definitely leave this to community and see if there are any takers for kubespray

arm4b commented 6 years ago

Right now our image size is 390M - That is way too large for us to pull and deploy.

Should we build a package of st2 that works with apk package manager in alpine linux? Do we need this or can we just git clone the source, install the pip dependencies? We don't need dh-virtualenv because we are now inside Docker? Can we quickly figure out the image size?

390M is a good enough size for a container, especially considering:

vagrant@stackstorm:~$ du -hs /opt/stackstorm/
499M    /opt/stackstorm/

I think alpine is more for fun thing, will save ~50MBs or so of the image size.

^^ Feels like not that much sense to improve the artifact size part.

arm4b commented 6 years ago

Overall a lot of useful info here :+1:

Considering there are areas to improve or even re-architect in StackStorm itself for HA, - it may be a big/long story to play it well.

LindsayHill commented 6 years ago

it may be a big/long story to play it well

I see it playing out over multiple releases. Build cookie cutter, iterate, improve, scale test, improve, etc...

Kami commented 6 years ago

I agree with @armab - this is also a great opportunity to try to simplify the architecture and reduce number of services, where possible.

One thing we talked about in the past when we discussed new workflow engine was st2resultracker and st2notifier.

We should get rid of at least one of them and potentially rename / split notifier (scheduler, notifier). If the Mistral callbacks change turn out to work well, there should be very little need for st2resulttracker left (and we would still have CLI tool which would allow user to rectify executions, if needed).

blag commented 6 years ago

Conversion progress about Mistral -> Orquestra migrations across the repositories:

[ ] st2cicd - StackStorm/st2cicd#86 Merged, waiting to deploy again
[ ] st2ci - StackStorm/st2ci#117 WIP
[ ] st2cd - StackStorm/st2cd#347 WIP
[x] AWS pack tag v0.10.1 and - Fix run-remote-cmd for old tag, and add a bugfix version to that tag
[x] AWS pack tag v0.10.2 - fix passing user_data to the AWS instance

PRs to st2:

[x] StackStorm/st2#4431 - More thoroughly test ctx(st2) access - worked around.
[x] StackStorm/st2#4435 - Add tests for action_context in action metadata; was fixed in StackStorm/st2#4443.

PRs to orquestaconvert:

[x] EncoreTechnologies/orquestaconvert#4 - macOS support, a few fixups
[x] EncoreTechnologies/orquestaconvert#5 - Mistral translation strings
[x] EncoreTechnologies/orquestaconvert#6 - Add --force option, to mostly convert workflows even if the conversion isn't 100% successful. This makes it easier to convert them by hand.
[x] EncoreTechnologies/orquestaconvert#7 - Add script to convert an entire pack. Implemented via Bash script, but need to convert script to Python, add tests, and report a summary of successful conversions and a roll up of unsuccessful conversions.
[x] EncoreTechnologies/orquestaconvert#9 - Support publishing booleans, integers, and null values.
[x] EncoreTechnologies/orquestaconvert#11 - Convert dashes in task names to underscores
[x] EncoreTechnologies/orquestaconvert#12 - Make target to report test coverage
[x] EncoreTechnologies/orquestaconvert#13 - Add validate and verbose flags to aid manual conversion.
[x] EncoreTechnologies/orquestaconvert#16 - Convert Mistral built-in tasks
[x] EncoreTechnologies/orquestaconvert#17 - Add support for converting with-items

According to the original Slack discussion, the AWS pack also needed to be converted, but I haven't been able to find any Mistral workflows in that. ~Are we intending to migrate the two ActionChain workflows?~ No.

LindsayHill commented 6 years ago

Are we intending to migrate the two ActionChain workflows?

Don't worry about those right now. Main thing is Mistral workflows. Action Chains can be done later.

StackStorm / community