StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Handle orphaned locks after network issues gracefully #4958

Open winem opened 4 years ago

winem commented 4 years ago

SUMMARY

St2 services do not fully recover from network partitions or other network issues where packets sent to RabbitMQ were dropped.

STACKSTORM VERSION

st2 3.3dev (9ea417346), on Python 3.6.9

Earlier versions are affected as well.

OS, environment, install method

I came across this issue when doing some tests for https://github.com/StackStorm/st2-dockerfiles/pull/28 and was able to reproduce it on plain Ubuntu VMs as well.

Steps to reproduce the problem

  1. Start all services
  2. Drop the packages being sent to the RabbitMQ instance In my case I did this by prepending an iptables rule to the FORWARD chain: sudo iptables -I FORWARD 1 -d 172.23.0.3 -j DROP
  3. Wait until the first exceptions occur and the connections attempts from the st2 services fail with a timeout. The log:
    [--- st2api ---]2020-05-23 20:17:36,162 WARNING [-] Broker connection error, trying again in 2.0 seconds: timeout('timed out',).
    [--- st2api ---]Traceback (most recent call last):
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 137, in _connect
    [--- st2api ---]    host, port, family, socket.SOCK_STREAM, SOL_TCP)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/support/greendns.py", line 537, in getaddrinfo
    [--- st2api ---]    raise socket.gaierror(socket.EAI_NONAME, 'No address found')
    [--- st2api ---]socket.gaierror: [Errno -2] No address found
    [--- st2api ---]
    [--- st2api ---]During handling of the above exception, another exception occurred:
    [--- st2api ---]
    [--- st2api ---]Traceback (most recent call last):
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/utils/functional.py", line 343, in retry_over_time
    [--- st2api ---]    return fun(*args, **kwargs)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 283, in connect
    [--- st2api ---]    return self.connection
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 837, in connection
    [--- st2api ---]    self._connection = self._establish_connection()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 792, in _establish_connection
    [--- st2api ---]    conn = self.transport.establish_connection()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
    [--- st2api ---]    conn.connect()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 311, in connect
    [--- st2api ---]    self.transport.connect()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 77, in connect
    [--- st2api ---]    self._connect(self.host, self.port, self.connect_timeout)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 148, in _connect
    [--- st2api ---]    "failed to resolve broker hostname"))
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 161, in _connect
    [--- st2api ---]    self.sock.connect(sa)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 263, in connect
    [--- st2api ---]    self._trampoline(fd, write=True, timeout=timeout, timeout_exc=_timeout_exc)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 210, in _trampoline
    [--- st2api ---]    mark_as_closed=self._mark_as_closed)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/__init__.py", line 159, in trampoline
    [--- st2api ---]    return hub.switch()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
    [--- st2api ---]    return self.greenlet.switch()
    [--- st2api ---]socket.timeout: timed out
    [--- st2api ---]2020-05-23 20:17:36,180 WARNING [-] Broker connection error, trying again in 2.0 seconds: timeout('timed out',).
    [--- st2api ---]Traceback (most recent call last):
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 137, in _connect
    [--- st2api ---]    host, port, family, socket.SOCK_STREAM, SOL_TCP)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/support/greendns.py", line 537, in getaddrinfo
    [--- st2api ---]    raise socket.gaierror(socket.EAI_NONAME, 'No address found')
    [--- st2api ---]socket.gaierror: [Errno -2] No address found
    [--- st2api ---]
    [--- st2api ---]During handling of the above exception, another exception occurred:
    [--- st2api ---]
    [--- st2api ---]Traceback (most recent call last):
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/utils/functional.py", line 343, in retry_over_time
    [--- st2api ---]    return fun(*args, **kwargs)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 283, in connect
    [--- st2api ---]    return self.connection
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 837, in connection
    [--- st2api ---]    self._connection = self._establish_connection()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/connection.py", line 792, in _establish_connection
    [--- st2api ---]    conn = self.transport.establish_connection()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/transport/pyamqp.py", line 130, in establish_connection
    [--- st2api ---]    conn.connect()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 311, in connect
    [--- st2api ---]    self.transport.connect()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 77, in connect
    [--- st2api ---]    self._connect(self.host, self.port, self.connect_timeout)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 148, in _connect
    [--- st2api ---]    "failed to resolve broker hostname"))
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/transport.py", line 161, in _connect
    [--- st2api ---]    self.sock.connect(sa)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 263, in connect
    [--- st2api ---]    self._trampoline(fd, write=True, timeout=timeout, timeout_exc=_timeout_exc)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenio/base.py", line 210, in _trampoline
    [--- st2api ---]    mark_as_closed=self._mark_as_closed)
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/__init__.py", line 159, in trampoline
    [--- st2api ---]    return hub.switch()
    [--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
    [--- st2api ---]    return self.greenlet.switch()
    [--- st2api ---]socket.timeout: timed out
  4. Revert 3 and re-establish the connection between the st2 services and RabbitMQ again: sudo iptables -D FORWARD -d 172.23.0.3 -j DROP

Expected Results

St2 services should work as intended as soon as they are able to connect to RabbitMQ again

Actual Results

The St2 services are able to establish the connection to the RabbitMQ service but can't serve any requests as the resource is locked from the previous lock which is orphaned now:

The appearing error:

[--- st2api ---]2020-05-23 20:21:38,654 INFO [-] Connected to amqp://guest:**@rabbitmq:5672//
[--- st2api ---]2020-05-23 20:21:38,665 INFO [-] Connected to amqp://guest:**@rabbitmq:5672//
[--- st2api ---]2020-05-23 20:21:38,665 WARNING [-] Connection to broker lost, trying to re-establish connection...
[--- st2api ---]Traceback (most recent call last):
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 175, in run
[--- st2api ---]    for _ in self.consume(limit=None, **kwargs):
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 191, in consume
[--- st2api ---]    with self.consumer_context(**kwargs) as (conn, channel, consumers):
[--- st2api ---]  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
[--- st2api ---]    return next(self.gen)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 184, in consumer_context
[--- st2api ---]    with self.Consumer() as (connection, channel, consumers):
[--- st2api ---]  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
[--- st2api ---]    return next(self.gen)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/mixins.py", line 233, in Consumer
[--- st2api ---]    with self._consume_from(*self.get_consumers(cls, channel)) as c:
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/services/triggerwatcher.py", line 77, in get_consumers
[--- st2api ---]    callbacks=[self.process_task])]
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/messaging.py", line 386, in __init__
[--- st2api ---]    self.revive(self.channel)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/messaging.py", line 408, in revive
[--- st2api ---]    self.declare()
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/messaging.py", line 421, in declare
[--- st2api ---]    queue.declare()
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/entity.py", line 609, in declare
[--- st2api ---]    self._create_queue(nowait=nowait, channel=channel)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/entity.py", line 618, in _create_queue
[--- st2api ---]    self.queue_declare(nowait=nowait, passive=False, channel=channel)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/kombu/entity.py", line 653, in queue_declare
[--- st2api ---]    nowait=nowait,
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/channel.py", line 1149, in queue_declare
[--- st2api ---]    spec.Queue.DeclareOk, returns_tuple=True,
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/abstract_channel.py", line 88, in wait
[--- st2api ---]    self.connection.drain_events(timeout=timeout)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 505, in drain_events
[--- st2api ---]    while not self.blocking_read(timeout):
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 511, in blocking_read
[--- st2api ---]    return self.on_inbound_frame(frame)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/method_framing.py", line 55, in on_frame
[--- st2api ---]    callback(channel, method_sig, buf, None)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/connection.py", line 518, in on_inbound_method
[--- st2api ---]    method_sig, payload, content,
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
[--- st2api ---]    listener(*args)
[--- st2api ---]  File "/opt/stackstorm/st2/lib/python3.6/site-packages/amqp/channel.py", line 280, in _on_close
[--- st2api ---]    reply_code, reply_text, (class_id, method_id), ChannelError,
[--- st2api ---]amqp.exceptions.ResourceLocked: Queue.declare: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'st2.trigger.watch.TimersController-4d50c0c7ad' in vhost '/'. It could be originally declared on another connection or the exclusive property value does not match that of the original dec...

I was able to reproduce this for different services and assume that the RabbitMQ client implementation is the same for all components. So it's likely that all services communicating with the broker are affected. Services that are affected for sure: st2api st2auth st2sensorcontainer

Although this issue will be fixed with a restart of the affected st2 service it would be nice to have st2 handle issues like this gracefully so that we minimize the risk for unnoticed issues like this to appear and cause potential impact.

magiceses commented 3 years ago

i have similar problem.

` 2020-11-20T04:05:14.520187575+08:00 stdout F 2020-11-20 04:05:14,519 WARNING [-] Connectionto broker lost, trying to re-establish connection... 2020-11-20T04:05:14.520222755+08:00 stdout F Traceback (most recent call last): 2020-11-20T04:05:14.520233856+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 170, in run 2020-11-20T04:05:14.520242396+08:00 stdout F for _ in self.consume(limit=None, kwargs): 2020-11-20T04:05:14.520252136+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 192, in consume 2020-11-20T04:05:14.520260296+08:00 stdout F conn.drain_events(timeout=safety_interval) 2020-11-20T04:05:14.520269176+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events 2020-11-20T04:05:14.520277516+08:00 stdout F return self.transport.drain_events(self.connection, kwargs) 2020-11-20T04:05:14.520415397+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events 2020-11-20T04:05:14.520433617+08:00 stdout F return connection.drain_events(**kwargs) 2020-11-20T04:05:14.520442137+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 505, in drain_events 2020-11-20T04:05:14.520449837+08:00 stdout F while not self.blocking_read(timeout): 2020-11-20T04:05:14.520459397+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 510, in blocking_read 2020-11-20T04:05:14.520466897+08:00 stdout F frame = self.transport.read_frame() 2020-11-20T04:05:14.520474377+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/transport.py", line 252, in read_frame 2020-11-20T04:05:14.520481898+08:00 stdout F frame_header = read(7, True) 2020-11-20T04:05:14.520489638+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/transport.py", line 446, in _read 2020-11-20T04:05:14.520497018+08:00 stdout F raise IOError('Server unexpectedly closed connection') 2020-11-20T04:05:14.520504698+08:00 stdout F IOError: Server unexpectedly closed connection

2020-11-20T04:05:14.544874708+08:00 stdout F 2020-11-20 04:05:14,544 INFO [-] Connected to amqp://rabbitmq:@rabbitmq.openstack.svc.cluster.local:5672// 2020-11-20T04:05:14.5464913+08:00 stdout F 2020-11-20 04:05:14,546 INFO [-] Connected to amqp://rabbitmq:@rabbitmq.openstack.svc.cluster.local:5672// 2020-11-20T04:05:14.889177071+08:00 stdout F 2020-11-20 04:05:14,555 WARNING [-] Connectionto broker lost, trying to re-establish connection... 2020-11-20T04:05:14.889231732+08:00 stdout F Traceback (most recent call last): 2020-11-20T04:05:14.889246212+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 170, in run 2020-11-20T04:05:14.889255772+08:00 stdout F for _ in self.consume(limit=None, kwargs): 2020-11-20T04:05:14.889265432+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 186, in consume 2020-11-20T04:05:14.889273732+08:00 stdout F with self.consumer_context(kwargs) as (conn, channel, consumers): 2020-11-20T04:05:14.889302792+08:00 stdout F File "/usr/lib64/python2.7/contextlib.py", line 17, in enter 2020-11-20T04:05:14.889311612+08:00 stdout F return self.gen.next() 2020-11-20T04:05:14.889319612+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 179, in consumer_context 2020-11-20T04:05:14.889327672+08:00 stdout F with self.Consumer() as (connection, channel, consumers): 2020-11-20T04:05:14.889335192+08:00 stdout F File "/usr/lib64/python2.7/contextlib.py", line 17, in enter 2020-11-20T04:05:14.889342852+08:00 stdout F return self.gen.next() 2020-11-20T04:05:14.889352813+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 228, in Consumer 2020-11-20T04:05:14.889361033+08:00 stdout F with self._consume_from(self.get_consumers(cls, channel)) as c: 2020-11-20T04:05:14.889368953+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/services/triggerwatcher.py", line 76, in get_consumers 2020-11-20T04:05:14.889380093+08:00 stdout F callbacks=[self.process_task])] 2020-11-20T04:05:14.889387873+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 386, in init 2020-11-20T04:05:14.889396293+08:00 stdout F self.revive( 2020-11-20T04:05:14.889404333+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 408, in revive 2020-11-20T04:05:14.889412293+08:00 stdout F self.declare() 2020-11-20T04:05:14.889421733+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 421, in declare 2020-11-20T04:05:14.889429253+08:00 stdout F queue.declare() 2020-11-20T04:05:14.889436713+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 605, in declare 2020-11-20T04:05:14.889444213+08:00 stdout F self._create_queue(nowait=nowait, channel=channel) 2020-11-20T04:05:14.889453773+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 614, in _create_queue 2020-11-20T04:05:14.889462653+08:00 stdout F self.queue_declare(nowait=nowait, passive=False, channel=channel) 2020-11-20T04:05:14.889471053+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 649, in queue_declare 2020-11-20T04:05:14.889480694+08:00 stdout F nowait=nowait, 2020-11-20T04:05:14.889488674+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/channel.py", line 1149, in queue_declare 2020-11-20T04:05:14.889496534+08:00 stdout F spec.Queue.DeclareOk, returns_tuple=True, 2020-11-20T04:05:14.889504074+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/abstract_channel.py", line 88, in wait 2020-11-20T04:05:14.889511874+08:00 stdout F self.connection.drain_events(timeout=timeout) 2020-11-20T04:05:14.889518914+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 505, in drain_events 2020-11-20T04:05:14.889526394+08:00 stdout F while not self.blocking_read(timeout): 2020-11-20T04:05:14.889534174+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 511, in blocking_read 2020-11-20T04:05:14.889541534+08:00 stdout F return self.on_inbound_frame(frame) 2020-11-20T04:05:14.889548834+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/method_framing.py", line 55, in on_frame 2020-11-20T04:05:14.889556294+08:00 stdout F callback(channel, method_sig, buf, None) 2020-11-20T04:05:14.889563774+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 518, in on_inbound_method 2020-11-20T04:05:14.889571294+08:00 stdout F method_sig, payload, content, 2020-11-20T04:05:14.889578314+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method 2020-11-20T04:05:14.889586154+08:00 stdout F listener(args) 2020-11-20T04:05:14.889593834+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/channel.py", line 280, in _on_close 2020-11-20T04:05:14.889602114+08:00 stdout F reply_code, reply_text, (class_id, method_id), ChannelError, 2020-11-20T04:05:14.889610055+08:00 stdout F ResourceLocked: Queue.declare: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'st2.trigger.watch.TimersController-e1c29b5f8d' in vhost '/'. It could be originally declared on another connection or the exclusive property value does not match that of the original dec...

`