Open chain312 opened 11 months ago
The StackStorm performance indeed has lots of bottlenecks and while we have an official K8s HA-focused deployment, it's important to keep in mind that platform was designed and created before the K8s mainstream era. Since that, the st2 core wasn't optimized or profiled for such a setup with big amount of pods, - the development effort is missing in this area.
With that, the latency for every st2 component, including backends is important. K8s adds its own latency and with big amount of Pods you'll likely get a lot of network spam, error rate, retries and may overdeploy things. I may guess that 20-30 pods for each microservice could degrade the system rather than help. I'd try to play with the Pod numbers for each component there.
The best HA results I've seen so far in a simple dual-VM setup with lots of cores (CPU-optimized). This StackStorm cluster connects to a dedicated MongoDB, RabbitMQ and Redis clusters (RAM-optimized, VM-based too), with proper DBA practices, buffering, caching, monitoring and kernel tuning. Aka the old way :/
Are your DBs/backends in K8s too? How's the setup is looking in there? If K8s is absolutely necesarry, - maybe worth trying the hybrid way, - keeping st2 pods in the K8s and DB/backends on a dedicated VM-based clusters. Just as an experimental deployment to see if that can make any difference, if you're allowed to go outside of K8s in your architecture requirements.
Are your DBs/backends in K8s too? How's the setup is looking in there? If K8s is absolutely necesarry, - maybe worth trying the hybrid way, - keeping st2 pods in the K8s and DB/backends on a dedicated VM-based clusters. Just as an experimental deployment to see if that can make any difference, if you're allowed to go outside of K8s in your architecture requirements.
I have deployed rabbitmq, mongo and redis in docker outside of k8s, but they are all in single-node mode, not in multi-copy mode. Should I deploy these middleware directly on the physical server next?
The StackStorm performance indeed has lots of bottlenecks and while we have an official K8s HA-focused deployment, it's important to keep in mind that platform was designed and created before the K8s mainstream era. Since that, the st2 core wasn't optimized or profiled for such a setup with big amount of pods, - the development effort is missing in this area.
With that, the latency for every st2 component, including backends is important. K8s adds its own latency and with big amount of Pods you'll likely get a lot of network spam, error rate, retries and may overdeploy things. I may guess that 20-30 pods for each microservice could degrade the system rather than help. I'd try to play with the Pod numbers for each component there.
The best HA results I've seen so far in a simple dual-VM setup with lots of cores (CPU-optimized). This StackStorm cluster connects to a dedicated MongoDB, RabbitMQ and Redis clusters (RAM-optimized, VM-based too), with proper DBA practices, buffering, caching, monitoring and kernel tuning. Aka the old way :/
Is there any performance indicator for the set you deployed, such as the rate of action, the rate of workflow, and other rates?
The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements.
The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements.
Is there any plan in the project to improve the tooz lock?
The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements. @guzzijones @arm4b I took a look at the code, and it seems like the st2scheduler service is fetching data from the st2.action_execution_scheduling_queue_item_db database. Why isn't this data stored in the message queue (MQ)?
SUMMARY
I used k8s to deploy st2, and now the entire workflow action is running slowly, I tried to increase the number of replicas to increase the execution speed, but it doesn't seem to help
STACKSTORM VERSION
st2 3.8.0,on Python 3.8.10
OS, environment, install method
Kubernetes
Steps to reproduce the problem
The number of pods for each of my microservices in k8s is as follows
Expected Results
Actual Results
I've looked at mongo's slow queries before, adding composite indexes to increase the query speed, and now that I see some indexes in mongo that are never used, the next step is probably to remove all unused indexes from mongo. According to the monitoring, it seems that too few plays can be matched. Is it necessary to increase the number of Pods for st2rulesengine? Do you have any good suggestions?
Thanks!