Open winem opened 4 years ago
i have similar problem.
` 2020-11-20T04:05:14.520187575+08:00 stdout F 2020-11-20 04:05:14,519 WARNING [-] Connectionto broker lost, trying to re-establish connection... 2020-11-20T04:05:14.520222755+08:00 stdout F Traceback (most recent call last): 2020-11-20T04:05:14.520233856+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 170, in run 2020-11-20T04:05:14.520242396+08:00 stdout F for _ in self.consume(limit=None, kwargs): 2020-11-20T04:05:14.520252136+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 192, in consume 2020-11-20T04:05:14.520260296+08:00 stdout F conn.drain_events(timeout=safety_interval) 2020-11-20T04:05:14.520269176+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events 2020-11-20T04:05:14.520277516+08:00 stdout F return self.transport.drain_events(self.connection, kwargs) 2020-11-20T04:05:14.520415397+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events 2020-11-20T04:05:14.520433617+08:00 stdout F return connection.drain_events(**kwargs) 2020-11-20T04:05:14.520442137+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 505, in drain_events 2020-11-20T04:05:14.520449837+08:00 stdout F while not self.blocking_read(timeout): 2020-11-20T04:05:14.520459397+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 510, in blocking_read 2020-11-20T04:05:14.520466897+08:00 stdout F frame = self.transport.read_frame() 2020-11-20T04:05:14.520474377+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/transport.py", line 252, in read_frame 2020-11-20T04:05:14.520481898+08:00 stdout F frame_header = read(7, True) 2020-11-20T04:05:14.520489638+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/transport.py", line 446, in _read 2020-11-20T04:05:14.520497018+08:00 stdout F raise IOError('Server unexpectedly closed connection') 2020-11-20T04:05:14.520504698+08:00 stdout F IOError: Server unexpectedly closed connection
2020-11-20T04:05:14.544874708+08:00 stdout F 2020-11-20 04:05:14,544 INFO [-] Connected to amqp://rabbitmq:@rabbitmq.openstack.svc.cluster.local:5672// 2020-11-20T04:05:14.5464913+08:00 stdout F 2020-11-20 04:05:14,546 INFO [-] Connected to amqp://rabbitmq:@rabbitmq.openstack.svc.cluster.local:5672// 2020-11-20T04:05:14.889177071+08:00 stdout F 2020-11-20 04:05:14,555 WARNING [-] Connectionto broker lost, trying to re-establish connection... 2020-11-20T04:05:14.889231732+08:00 stdout F Traceback (most recent call last): 2020-11-20T04:05:14.889246212+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 170, in run 2020-11-20T04:05:14.889255772+08:00 stdout F for _ in self.consume(limit=None, kwargs): 2020-11-20T04:05:14.889265432+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 186, in consume 2020-11-20T04:05:14.889273732+08:00 stdout F with self.consumer_context(kwargs) as (conn, channel, consumers): 2020-11-20T04:05:14.889302792+08:00 stdout F File "/usr/lib64/python2.7/contextlib.py", line 17, in enter 2020-11-20T04:05:14.889311612+08:00 stdout F return self.gen.next() 2020-11-20T04:05:14.889319612+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 179, in consumer_context 2020-11-20T04:05:14.889327672+08:00 stdout F with self.Consumer() as (connection, channel, consumers): 2020-11-20T04:05:14.889335192+08:00 stdout F File "/usr/lib64/python2.7/contextlib.py", line 17, in enter 2020-11-20T04:05:14.889342852+08:00 stdout F return self.gen.next() 2020-11-20T04:05:14.889352813+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/mixins.py", line 228, in Consumer 2020-11-20T04:05:14.889361033+08:00 stdout F with self._consume_from(self.get_consumers(cls, channel)) as c: 2020-11-20T04:05:14.889368953+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/services/triggerwatcher.py", line 76, in get_consumers 2020-11-20T04:05:14.889380093+08:00 stdout F callbacks=[self.process_task])] 2020-11-20T04:05:14.889387873+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 386, in init 2020-11-20T04:05:14.889396293+08:00 stdout F self.revive( 2020-11-20T04:05:14.889404333+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 408, in revive 2020-11-20T04:05:14.889412293+08:00 stdout F self.declare() 2020-11-20T04:05:14.889421733+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/messaging.py", line 421, in declare 2020-11-20T04:05:14.889429253+08:00 stdout F queue.declare() 2020-11-20T04:05:14.889436713+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 605, in declare 2020-11-20T04:05:14.889444213+08:00 stdout F self._create_queue(nowait=nowait, channel=channel) 2020-11-20T04:05:14.889453773+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 614, in _create_queue 2020-11-20T04:05:14.889462653+08:00 stdout F self.queue_declare(nowait=nowait, passive=False, channel=channel) 2020-11-20T04:05:14.889471053+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/kombu/entity.py", line 649, in queue_declare 2020-11-20T04:05:14.889480694+08:00 stdout F nowait=nowait, 2020-11-20T04:05:14.889488674+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/channel.py", line 1149, in queue_declare 2020-11-20T04:05:14.889496534+08:00 stdout F spec.Queue.DeclareOk, returns_tuple=True, 2020-11-20T04:05:14.889504074+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/abstract_channel.py", line 88, in wait 2020-11-20T04:05:14.889511874+08:00 stdout F self.connection.drain_events(timeout=timeout) 2020-11-20T04:05:14.889518914+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 505, in drain_events 2020-11-20T04:05:14.889526394+08:00 stdout F while not self.blocking_read(timeout): 2020-11-20T04:05:14.889534174+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 511, in blocking_read 2020-11-20T04:05:14.889541534+08:00 stdout F return self.on_inbound_frame(frame) 2020-11-20T04:05:14.889548834+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/method_framing.py", line 55, in on_frame 2020-11-20T04:05:14.889556294+08:00 stdout F callback(channel, method_sig, buf, None) 2020-11-20T04:05:14.889563774+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/connection.py", line 518, in on_inbound_method 2020-11-20T04:05:14.889571294+08:00 stdout F method_sig, payload, content, 2020-11-20T04:05:14.889578314+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method 2020-11-20T04:05:14.889586154+08:00 stdout F listener(args) 2020-11-20T04:05:14.889593834+08:00 stdout F File "/opt/stackstorm/st2/lib/python2.7/site-packages/amqp/channel.py", line 280, in _on_close 2020-11-20T04:05:14.889602114+08:00 stdout F reply_code, reply_text, (class_id, method_id), ChannelError, 2020-11-20T04:05:14.889610055+08:00 stdout F ResourceLocked: Queue.declare: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'st2.trigger.watch.TimersController-e1c29b5f8d' in vhost '/'. It could be originally declared on another connection or the exclusive property value does not match that of the original dec...
`
SUMMARY
St2 services do not fully recover from network partitions or other network issues where packets sent to RabbitMQ were dropped.
STACKSTORM VERSION
st2 3.3dev (9ea417346), on Python 3.6.9
Earlier versions are affected as well.
OS, environment, install method
I came across this issue when doing some tests for https://github.com/StackStorm/st2-dockerfiles/pull/28 and was able to reproduce it on plain Ubuntu VMs as well.
Steps to reproduce the problem
sudo iptables -I FORWARD 1 -d 172.23.0.3 -j DROP
sudo iptables -D FORWARD -d 172.23.0.3 -j DROP
Expected Results
St2 services should work as intended as soon as they are able to connect to RabbitMQ again
Actual Results
The St2 services are able to establish the connection to the RabbitMQ service but can't serve any requests as the resource is locked from the previous lock which is orphaned now:
The appearing error:
I was able to reproduce this for different services and assume that the RabbitMQ client implementation is the same for all components. So it's likely that all services communicating with the broker are affected. Services that are affected for sure: st2api st2auth st2sensorcontainer
Although this issue will be fixed with a restart of the affected st2 service it would be nice to have st2 handle issues like this gracefully so that we minimize the risk for unnoticed issues like this to appear and cause potential impact.