StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.09k stars 746 forks source link

Workflows stuck in delayed queue since hours/days and never gets executed #6230

Open zsmanjot opened 2 months ago

zsmanjot commented 2 months ago

Hi There, If anyone could help me with an issue, we are using ST2 extensively and many workflows runs on the box. The box has good configuration, in terms of memory and CPU.

It has been noticed that workflows gets queued and never gets executed and delayed queue is far too long always when checked.

For example:

If i check the delay queue running the following command (st2 execution list -l --status delayed) , i would be able to see workflows for 2 days before that never made to execution. Because of this , it is seen that other workflows also gets impacted in a way that it takes 50 minutes for a simple workflow to finish that generally takes 10 minutes.

Anybody who can help me here?

Example:

image
zsmanjot commented 2 months ago

The problem is getting increased as there is a lot of delay. I have checked it and found mongodb is running on high CPU here.

Any ideas what can be done here? I know that the triggers are too much these days to handle , so could it be the reason? If yes, how we can address this?

image
zsmanjot commented 2 months ago

Also i could see that in DB i have 4767 workflows in delayed state.

image
zsmanjot commented 2 months ago

@arm4b Any solution here?

chain312 commented 2 months ago

Can you show what state your movements are in? I asked the same question today at Slack. I have been troubled for a long time and I am still trying to solve it.

zsmanjot commented 2 months ago

My workflow never makes it to execution , the older ones. If any how it makes then it just keeps on holding at some or the other task for hours and completed by 2 or 3 Hours.

zsmanjot commented 2 months ago

Also, i am getting this error as well.

root@stackstorm:~# st2 execution list -l --status delayed -n 2000 2>/dev/null ERROR: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

guzzijones commented 2 months ago

This is a known issue. If rabbitmq retry connections are exhausted then an action is stuck running forever. likely your box is experiencing some network issues internally. Do your workflows create very large context or have very large inputs or outputs?

zsmanjot commented 2 months ago

Thanks @guzzijones for replying. No underlying network issues are there. Regarding the large inputs and outputs , no these are not very huge. But the things that has been noticed is the amount of triggers it is receiving now a days is huge.

But the main concern is ST2 keeps them in queues for days and never even executes it.

chain312 commented 2 months ago

谢谢你的回复。不存在底层网络问题。关于 大的输入和输出 ,不,这些不是很大。但已经注意到的是,它现在每天收到的触发数量是巨大的。

但主要问题是 ST2 让他们排队好几天,甚至从不执行它。

Can you see what state most workflow instances are in?

zsmanjot commented 2 months ago

They are all stuck in delayed state. More than 5000 workflows.

guzzijones commented 2 months ago

What is in your st2-workflow-engine logs and st2-action-runner logs. I bet you see disconnects to rabbit-mq.

zsmanjot commented 2 months ago

@guzzijones No i could not see rabbit-mq disconnects. Even if i try to purge older workflows it does not do anything and i have to grep IDs and cancel these older workflows manually.

This is a big performance issue.

zsmanjot commented 2 months ago

This is one of the example:

image

See the requested and scheduled time , 3 hours delay. How could we reduce this delay ? What are the factors that might be we are missing here? Any ideas?

chain312 commented 2 months ago

This is one of the example:

image

See the requested and scheduled time , 3 hours delay. How could we reduce this delay ? What are the factors that might be we are missing here? Any ideas?

It should be blocked. If the performance problem cannot be solved, you can add filtering rules to the rule to confirm which data needs to be processed automatically.