ResourceLocked: Queue.declare: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'st2.trigger.watch.St2Timer-883eb620b9' in vhost '/'

longfei-zhang commented 6 years ago

ISSUE TYPE

Bug Report

STACKSTORM VERSION

st2 2.6.0

OS / ENVIRONMENT / INSTALL METHOD

Kubernetes 1ppc mode

SUMMARY

2018-06-05 10:39:42,427 INFO [-] Connected to amqp://rabbitmq:**@rabbitmq.openstack.svc.cluster.local:5672//
2018-06-05 10:39:42,432 WARNING [-] Connection to broker lost, trying to re-establish connection...
Traceback (most recent call last):
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 170, in run
    for _ in self.consume(limit=None, **kwargs):
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 186, in consume
    with self.consumer_context(**kwargs) as (conn, channel, consumers):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 179, in consumer_context
    with self.Consumer() as (connection, channel, consumers):
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 228, in Consumer
    with self._consume_from(*self.get_consumers(cls, channel)) as c:
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/services/triggerwatcher.py", line 76, in get_consumers
    callbacks=[self.process_task])]
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 386, in __init__
    self.revive(self.channel)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 408, in revive
    self.declare()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 421, in declare
    queue.declare()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 605, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 614, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 649, in queue_declare
    nowait=nowait,
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/channel.py", line 1151, in queue_declare
    spec.Queue.DeclareOk, returns_tuple=True,
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/abstract_channel.py", line 79, in wait
    self.connection.drain_events(timeout=timeout)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 471, in drain_events
    while not self.blocking_read(timeout):
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 477, in blocking_read
    return self.on_inbound_frame(frame)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 481, in on_inbound_method
    method_sig, payload, content,
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/abstract_channel.py", line 128, in dispatch_method
    listener(*args)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/channel.py", line 279, in _on_close
    reply_code, reply_text, (class_id, method_id), ChannelError,
ResourceLocked: Queue.declare: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'st2.trigger.watch.St2Timer-883eb620b9' in vhost '/'

STEPS TO REPRODUCE

We're having on-and-off connectivity issues with RabbitMQ Cluster & Stackstorm and occasionally we're seeing a ResourceLocked error bubbling up from inside kombu.

This happens when stackstorm try to bind a new consumer to an exclusive queue. stackstorm uses exclusive, auto-delete queues. Exclusive means a queue can have exactly one consumer. Auto-delete means the queue is removed when its last consumer disconnects. The queue should therefore be removed when the consumer disconnects.

Rabbit takes some time to reap the old queues though. It looks like the PollingQueueConsumer is reconnecting and redeclaring its queue before the old one has had a chance to be removed.

There may be a bug here. Reconnecting to an exclusive queue seems to be asking for trouble.

For exclusive queues stayed there after recovered from network partition will have this error.

Clients have re-connected generating new exclusive queues. Old ones stays for some reason (they were not declared as durable). And when I try removing it from management UI, I get exact same error:

405 RESOURCE_LOCKED - cannot obtain exclusive access to locked queue ''st2.trigger.watch.St2Timer-883eb620b9' in vhost '/'

Is it possible to drop the exclusivity of the queue?

arm4b commented 6 years ago

Thanks for the report.

Are you running the StackStorm in K8s with 1 replicas for each service? When you say RabbitMQ cluster, is it really multi-server cluster or a single RabbitMQ instance? Please provide your RabbitMQ version, clustering strategy used, how many nodes are in the cluster and if there any non-standard configuration for RabbitMQ.

That'll help us to understand it.

longfei-zhang commented 6 years ago

@armab Yes, I'm running StackStorm in K8s with 1 replicas for each service.

It's a really RabbitMQ cluster:

kubectl get pods -n openstack -o wide | grep rabbit
prometheus-rabbitmq-exporter-5c98d7b8bd-64sdp    1/1       Running   0          1h        10.233.66.65   node-2
rabbitmq-0                                       1/1       Running   0          1h        10.233.64.72   node-3
rabbitmq-1                                       1/1       Running   0          1h        10.233.66.76   node-2
rabbitmq-2                                       1/1       Running   0          1h        10.233.65.66   node-1

Rabbit version is 3.7.0 and there are 3 nodes in the mq cluster

rabbitmqctl cluster_status
Cluster status of node rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local ...
[{nodes,
     [{disc,
          ['rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@rabbitmq-1.rabbitmq-discovery.openstack.svc.cluster.local',
           'rabbit@rabbitmq-2.rabbitmq-discovery.openstack.svc.cluster.local']}]},
 {running_nodes,
     ['rabbit@rabbitmq-2.rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@rabbitmq-1.rabbitmq-discovery.openstack.svc.cluster.local',
      'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local']},
 {cluster_name,
     <<"rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local">>},
 {partitions,[]},
 {alarms,
     [{'rabbit@rabbitmq-2.rabbitmq-discovery.openstack.svc.cluster.local',[]},
      {'rabbit@rabbitmq-1.rabbitmq-discovery.openstack.svc.cluster.local',[]},
      {'rabbit@rabbitmq-0.rabbitmq-discovery.openstack.svc.cluster.local',
          []}]}]

Clustering strategy:

rabbitmqctl list_policies
Listing policies for vhost "/" ...
/   ha-all  ^(?!amq\.).*    all {"ha-mode":"all"}   0

longfei-zhang commented 6 years ago

@armab sorry , I'm running StackStorm in K8s with different replicas for each service.

kubectl get pods -n openstack -o wide
mistral-api-777d6bc7bb-h2zsv               1/1       Running   0          2h        10.233.65.3    node-1
mistral-api-777d6bc7bb-ldc62               1/1       Running   0          2h        10.233.64.4    node-3
mistral-api-777d6bc7bb-r6cf9               1/1       Running   1          2h        10.233.66.4    node-2
mistral-server-564db59bbd-59h2l            1/1       Running   0          2h        10.233.64.5    node-3
mistral-server-564db59bbd-w5qzf            1/1       Running   0          2h        10.233.65.4    node-1
mistral-server-564db59bbd-xvxdq            1/1       Running   0          2h        10.233.66.5    node-2
st2actionrunner-0                          1/1       Running   1          2h        10.233.64.75   node-3
st2api-7f548d8d47-pm97r                    1/1       Running   2          2h        10.233.66.6    node-2
st2api-7f548d8d47-s7pqj                    1/1       Running   0          2h        10.233.65.7    node-1
st2api-7f548d8d47-wxvj7                    1/1       Running   1          2h        10.233.64.10   node-3
st2auth-7dbff44b54-7c97h                   1/1       Running   0          2h        10.233.64.8    node-3
st2auth-7dbff44b54-bhgg2                   1/1       Running   2          2h        10.233.66.7    node-2
st2auth-7dbff44b54-bwrtc                   1/1       Running   0          2h        10.233.65.6    node-1
st2garbagecollector-5646dbc79d-zzkcs       1/1       Running   0          2h        10.233.65.5    node-1
st2notifier-78855cc94b-6ntsb               1/1       Running   0          2h        10.233.65.8    node-1
st2notifier-78855cc94b-nz2lj               1/1       Running   1          2h        10.233.64.6    node-3
st2notifier-78855cc94b-tzqd5               1/1       Running   1          2h        10.233.66.9    node-2
st2resultstracker-f744fb746-96h2t          1/1       Running   1          2h        10.233.64.7    node-3
st2resultstracker-f744fb746-wd28s          1/1       Running   2          2h        10.233.66.8    node-2
st2resultstracker-f744fb746-zcbt9          1/1       Running   1          2h        10.233.65.9    node-1
st2rulesengine-55d897f74-ldngt             1/1       Running   0          2h        10.233.64.9    node-3
st2sensorcontainer-56f5fbcf8b-sszhb        1/1       Running   2          2h        10.233.66.10   node-2
st2stream-6f9988b848-n45tj                 1/1       Running   1          2h        10.233.65.10   node-1
st2stream-6f9988b848-sdvws                 1/1       Running   1          2h        10.233.64.11   node-3
st2stream-6f9988b848-tnr2p                 1/1       Running   2          2h        10.233.66.11   node-2

here r my helm chart values:

pod:
  replicas:
    st2web: 3
    st2stream: 3
    st2sensorcontainer: 1
    st2rulesengine: 1
    st2resultstracker: 3
    st2notifier: 3
    st2garbagecollector: 1
    st2auth: 3
    st2api: 3
    st2actionrunner: 1
    mistralserver: 3
    mistralapi: 3

BTW I have a question for the replica: https://forum.stackstorm.com/t/do-we-support-multiple-replicas-for-st2sensorcontainer-st2rulesengine-in-kubernetes-1ppc-mode/116

longfei-zhang commented 6 years ago

rabbitmqctl list_queues exclusive durable auto_delete  state name | grep st2 | sort -n
false   true    false   running st2.actionrunner.cancel
false   true    false   running st2.actionrunner.pause
false   true    false   running st2.actionrunner.req
false   true    false   running st2.actionrunner.resume
false   true    false   running st2.actionrunner.work
false   true    false   running st2.notifiers.execution.work
false   true    false   running st2.preinit
false   true    false   running st2.resultstracker.work
false   true    false   running st2.sensor.watch.sensor_container-3b319447d2
false   true    false   running st2.trigger_instances_dispatch.rules_engine
true    false   true    running st2.trigger.watch.St2Timer-3aaf30653b
true    false   true    running st2.trigger.watch.TimersController-2674dd8c90
true    false   true    running st2.trigger.watch.TimersController-75acf4c433
true    false   true    running st2.trigger.watch.TimersController-eed61cf1a9
true    false   true    running st2.trigger.watch.WebhooksController-4615ca5b33
true    false   true    running st2.trigger.watch.WebhooksController-4bbd99a887
true    false   true    running st2.trigger.watch.WebhooksController-fb05163785
true    false   true    running st2.trigger.watch.sensorwrapper_linux_FileWatchSensor-e7603a1c99

@armab any special reason using exclusive queue? It seems that only st2.trigger.watch.* queues are exclusive.

longfei-zhang commented 6 years ago

the reason why use exclusive queue: https://github.com/StackStorm/st2/pull/2081/commits/c94ccde663277fae97f5aec6dc9c5e14108c51a2

This issue resolved by https://github.com/rabbitmq/rabbitmq-server/issues/1323

arm4b commented 6 years ago

@longfei-zhang That's a good investigation work and a great find! :+1:

Sounds like distros have older rmq version available by default. Can you confirm fixed behavior with newer RabbitMQ version 3.6.11+?

Would you recommend to stick with official RabbitMQ apt/deb repositories which have updated/fixed versions?

longfei-zhang commented 6 years ago

Sorry this issue not fixed with newer RabbitMQ version 3.6.11+

Now we use a new policy and fixed the issue. Don't do the queue mirror for st2.trigger.watch queue fixed the issue. `rabbitmqctl set_policy ha-two '^(?!st2.trigger.watch.).' ' {"ha-mode":"exactly","ha-params":2,"ha-sync-mode":"automatic"}`

LindsayHill commented 6 years ago

Resolved by setting policy as above

StackStorm / st2