StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
5.97k stars 744 forks source link

st2 performance issues #6003

Open chain312 opened 11 months ago

chain312 commented 11 months ago

SUMMARY

I used k8s to deploy st2, and now the entire workflow action is running slowly, I tried to increase the number of replicas to increase the execution speed, but it doesn't seem to help

STACKSTORM VERSION

st2 3.8.0,on Python 3.8.10

OS, environment, install method

Kubernetes

Steps to reproduce the problem

The number of pods for each of my microservices in k8s is as follows

  1. st2sensor 4 Pods, sensor only kafka trigger is in use, kafka has 4 partitions, so I used 4 sensor Pods.
  2. st2actionrunner 30 Pods
  3. st2workflowengine 30 pods
  4. st2rulesengine 30 pods
  5. st2scheduler 20 pods
  6. st2notifier 20 pods
  7. st2garbagecollector 1 pods , the garbage collection mechanism is as follows: [garbagecollector] action_executions_ttl = 3 action_executions_output_ttl = 3 trigger_instances_ttl = 3 traces_ttl = 3 rule_enforcements_ttl = 3 workflow_executions_ttl = 3 task_executions_ttl = 3 tokens_ttl = 3
  8. st2client 1 pods
  9. st2auth 2 pods
  10. st2api 2 pods
  11. st2stream 2 pods

Expected Results

image image image image image image image

Actual Results

I've looked at mongo's slow queries before, adding composite indexes to increase the query speed, and now that I see some indexes in mongo that are never used, the next step is probably to remove all unused indexes from mongo. According to the monitoring, it seems that too few plays can be matched. Is it necessary to increase the number of Pods for st2rulesengine? Do you have any good suggestions?

Thanks!

arm4b commented 11 months ago

The StackStorm performance indeed has lots of bottlenecks and while we have an official K8s HA-focused deployment, it's important to keep in mind that platform was designed and created before the K8s mainstream era. Since that, the st2 core wasn't optimized or profiled for such a setup with big amount of pods, - the development effort is missing in this area.

With that, the latency for every st2 component, including backends is important. K8s adds its own latency and with big amount of Pods you'll likely get a lot of network spam, error rate, retries and may overdeploy things. I may guess that 20-30 pods for each microservice could degrade the system rather than help. I'd try to play with the Pod numbers for each component there.

The best HA results I've seen so far in a simple dual-VM setup with lots of cores (CPU-optimized). This StackStorm cluster connects to a dedicated MongoDB, RabbitMQ and Redis clusters (RAM-optimized, VM-based too), with proper DBA practices, buffering, caching, monitoring and kernel tuning. Aka the old way :/

arm4b commented 11 months ago

Are your DBs/backends in K8s too? How's the setup is looking in there? If K8s is absolutely necesarry, - maybe worth trying the hybrid way, - keeping st2 pods in the K8s and DB/backends on a dedicated VM-based clusters. Just as an experimental deployment to see if that can make any difference, if you're allowed to go outside of K8s in your architecture requirements.

chain312 commented 11 months ago

Are your DBs/backends in K8s too? How's the setup is looking in there? If K8s is absolutely necesarry, - maybe worth trying the hybrid way, - keeping st2 pods in the K8s and DB/backends on a dedicated VM-based clusters. Just as an experimental deployment to see if that can make any difference, if you're allowed to go outside of K8s in your architecture requirements.

I have deployed rabbitmq, mongo and redis in docker outside of k8s, but they are all in single-node mode, not in multi-copy mode. Should I deploy these middleware directly on the physical server next?

chain312 commented 11 months ago

The StackStorm performance indeed has lots of bottlenecks and while we have an official K8s HA-focused deployment, it's important to keep in mind that platform was designed and created before the K8s mainstream era. Since that, the st2 core wasn't optimized or profiled for such a setup with big amount of pods, - the development effort is missing in this area.

With that, the latency for every st2 component, including backends is important. K8s adds its own latency and with big amount of Pods you'll likely get a lot of network spam, error rate, retries and may overdeploy things. I may guess that 20-30 pods for each microservice could degrade the system rather than help. I'd try to play with the Pod numbers for each component there.

The best HA results I've seen so far in a simple dual-VM setup with lots of cores (CPU-optimized). This StackStorm cluster connects to a dedicated MongoDB, RabbitMQ and Redis clusters (RAM-optimized, VM-based too), with proper DBA practices, buffering, caching, monitoring and kernel tuning. Aka the old way :/

Is there any performance indicator for the set you deployed, such as the rate of action, the rate of workflow, and other rates?

guzzijones commented 11 months ago

The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements.

chain312 commented 10 months ago

The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements.

Is there any plan in the project to improve the tooz lock?

chain312 commented 4 months ago

The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements. @guzzijones @arm4b I took a look at the code, and it seems like the st2scheduler service is fetching data from the st2.action_execution_scheduling_queue_item_db database. Why isn't this data stored in the message queue (MQ)?