Closed longfei-zhang closed 6 years ago
Thanks for the report.
Are you running the StackStorm in K8s with 1 replicas for each service? When you say RabbitMQ cluster, is it really multi-server cluster or a single RabbitMQ instance? Please provide your RabbitMQ version, clustering strategy used, how many nodes are in the cluster and if there any non-standard configuration for RabbitMQ.
That'll help us to understand it.
@armab Yes, I'm running StackStorm in K8s with 1 replicas for each service.
It's a really RabbitMQ cluster:
kubectl get pods -n openstack -o wide | grep rabbit
prometheus-rabbitmq-exporter-5c98d7b8bd-64sdp 1/1 Running 0 1h 10.233.66.65 node-2
rabbitmq-0 1/1 Running 0 1h 10.233.64.72 node-3
rabbitmq-1 1/1 Running 0 1h 10.233.66.76 node-2
rabbitmq-2 1/1 Running 0 1h 10.233.65.66 node-1
Rabbit version is 3.7.0 and there are 3 nodes in the mq cluster
rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local ...
[{nodes,
[{disc,
['rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local',
'rabbit@rabbitmq-1.rabbitmq-discovery.openstack.svc.cluster.local',
'rabbit@rabbitmq-2.rabbitmq-discovery.openstack.svc.cluster.local']}]},
{running_nodes,
['rabbit@rabbitmq-2.rabbitmq-discovery.openstack.svc.cluster.local',
'rabbit@rabbitmq-1.rabbitmq-discovery.openstack.svc.cluster.local',
'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local']},
{cluster_name,
<<"rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local">>},
{partitions,[]},
{alarms,
[{'rabbit@rabbitmq-2.rabbitmq-discovery.openstack.svc.cluster.local',[]},
{'rabbit@rabbitmq-1.rabbitmq-discovery.openstack.svc.cluster.local',[]},
{'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local',
[]}]}]
Clustering strategy:
rabbitmqctl list_policies
Listing policies for vhost "/" ...
/ ha-all ^(?!amq\.).* all {"ha-mode":"all"} 0
@armab sorry , I'm running StackStorm in K8s with different replicas for each service.
kubectl get pods -n openstack -o wide
mistral-api-777d6bc7bb-h2zsv 1/1 Running 0 2h 10.233.65.3 node-1
mistral-api-777d6bc7bb-ldc62 1/1 Running 0 2h 10.233.64.4 node-3
mistral-api-777d6bc7bb-r6cf9 1/1 Running 1 2h 10.233.66.4 node-2
mistral-server-564db59bbd-59h2l 1/1 Running 0 2h 10.233.64.5 node-3
mistral-server-564db59bbd-w5qzf 1/1 Running 0 2h 10.233.65.4 node-1
mistral-server-564db59bbd-xvxdq 1/1 Running 0 2h 10.233.66.5 node-2
st2actionrunner-0 1/1 Running 1 2h 10.233.64.75 node-3
st2api-7f548d8d47-pm97r 1/1 Running 2 2h 10.233.66.6 node-2
st2api-7f548d8d47-s7pqj 1/1 Running 0 2h 10.233.65.7 node-1
st2api-7f548d8d47-wxvj7 1/1 Running 1 2h 10.233.64.10 node-3
st2auth-7dbff44b54-7c97h 1/1 Running 0 2h 10.233.64.8 node-3
st2auth-7dbff44b54-bhgg2 1/1 Running 2 2h 10.233.66.7 node-2
st2auth-7dbff44b54-bwrtc 1/1 Running 0 2h 10.233.65.6 node-1
st2garbagecollector-5646dbc79d-zzkcs 1/1 Running 0 2h 10.233.65.5 node-1
st2notifier-78855cc94b-6ntsb 1/1 Running 0 2h 10.233.65.8 node-1
st2notifier-78855cc94b-nz2lj 1/1 Running 1 2h 10.233.64.6 node-3
st2notifier-78855cc94b-tzqd5 1/1 Running 1 2h 10.233.66.9 node-2
st2resultstracker-f744fb746-96h2t 1/1 Running 1 2h 10.233.64.7 node-3
st2resultstracker-f744fb746-wd28s 1/1 Running 2 2h 10.233.66.8 node-2
st2resultstracker-f744fb746-zcbt9 1/1 Running 1 2h 10.233.65.9 node-1
st2rulesengine-55d897f74-ldngt 1/1 Running 0 2h 10.233.64.9 node-3
st2sensorcontainer-56f5fbcf8b-sszhb 1/1 Running 2 2h 10.233.66.10 node-2
st2stream-6f9988b848-n45tj 1/1 Running 1 2h 10.233.65.10 node-1
st2stream-6f9988b848-sdvws 1/1 Running 1 2h 10.233.64.11 node-3
st2stream-6f9988b848-tnr2p 1/1 Running 2 2h 10.233.66.11 node-2
here r my helm chart values:
pod:
replicas:
st2web: 3
st2stream: 3
st2sensorcontainer: 1
st2rulesengine: 1
st2resultstracker: 3
st2notifier: 3
st2garbagecollector: 1
st2auth: 3
st2api: 3
st2actionrunner: 1
mistralserver: 3
mistralapi: 3
BTW I have a question for the replica: https://forum.stackstorm.com/t/do-we-support-multiple-replicas-for-st2sensorcontainer-st2rulesengine-in-kubernetes-1ppc-mode/116
rabbitmqctl list_queues exclusive durable auto_delete state name | grep st2 | sort -n
false true false running st2.actionrunner.cancel
false true false running st2.actionrunner.pause
false true false running st2.actionrunner.req
false true false running st2.actionrunner.resume
false true false running st2.actionrunner.work
false true false running st2.notifiers.execution.work
false true false running st2.preinit
false true false running st2.resultstracker.work
false true false running st2.sensor.watch.sensor_container-3b319447d2
false true false running st2.trigger_instances_dispatch.rules_engine
true false true running st2.trigger.watch.St2Timer-3aaf30653b
true false true running st2.trigger.watch.TimersController-2674dd8c90
true false true running st2.trigger.watch.TimersController-75acf4c433
true false true running st2.trigger.watch.TimersController-eed61cf1a9
true false true running st2.trigger.watch.WebhooksController-4615ca5b33
true false true running st2.trigger.watch.WebhooksController-4bbd99a887
true false true running st2.trigger.watch.WebhooksController-fb05163785
true false true running st2.trigger.watch.sensorwrapper_linux_FileWatchSensor-e7603a1c99
@armab any special reason using exclusive queue? It seems that only st2.trigger.watch.* queues are exclusive.
the reason why use exclusive queue: https://github.com/StackStorm/st2/pull/2081/commits/c94ccde663277fae97f5aec6dc9c5e14108c51a2
This issue resolved by https://github.com/rabbitmq/rabbitmq-server/issues/1323
@longfei-zhang That's a good investigation work and a great find! :+1:
Sounds like distros have older rmq version available by default.
Can you confirm fixed behavior with newer RabbitMQ version 3.6.11+
?
Would you recommend to stick with official RabbitMQ apt/deb repositories which have updated/fixed versions?
Sorry this issue not fixed with newer RabbitMQ version 3.6.11+
Now we use a new policy and fixed the issue. Don't do the queue mirror for st2.trigger.watch queue fixed the issue. `rabbitmqctl set_policy ha-two '^(?!st2.trigger.watch.).' ' {"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}`
Resolved by setting policy as above
ISSUE TYPE
STACKSTORM VERSION
st2 2.6.0
OS / ENVIRONMENT / INSTALL METHOD
Kubernetes 1ppc mode
SUMMARY
STEPS TO REPRODUCE
We're having on-and-off connectivity issues with RabbitMQ Cluster & Stackstorm and occasionally we're seeing a ResourceLocked error bubbling up from inside kombu.
This happens when stackstorm try to bind a new consumer to an exclusive queue. stackstorm uses exclusive, auto-delete queues. Exclusive means a queue can have exactly one consumer. Auto-delete means the queue is removed when its last consumer disconnects. The queue should therefore be removed when the consumer disconnects.
Rabbit takes some time to reap the old queues though. It looks like the PollingQueueConsumer is reconnecting and redeclaring its queue before the old one has had a chance to be removed.
There may be a bug here. Reconnecting to an exclusive queue seems to be asking for trouble.
For exclusive queues stayed there after recovered from network partition will have this error.
Clients have re-connected generating new exclusive queues. Old ones stays for some reason (they were not declared as durable). And when I try removing it from management UI, I get exact same error:
405 RESOURCE_LOCKED - cannot obtain exclusive access to locked queue ''st2.trigger.watch.St2Timer-883eb620b9' in vhost '/'
Is it possible to drop the exclusivity of the queue?