F5Networks / f5-openstack-lbaasv2-driver

F5 LBaaSv2 service provider driver for OpenStack Liberty and beyond
http://clouddocs.f5.com/products/openstack/lbaasv2-driver/master/
Apache License 2.0
21 stars 37 forks source link

RPC Unexpected Frame error can occur with driver and agent code running #474

Closed pjbreaux closed 7 years ago

pjbreaux commented 7 years ago

Agent Version

Have seen it on agent releases 9.2.0 and 9.3.0.b2

Operating System

CentOS 7

OpenStack Release

both liberty and mitaka encounter this issue

Bug Severity

1

Severity: <Fill in level: 1 through 5>

Severity level definitions:

  1. Severity 1 (Critical) : Defect is causing systems to be offline and/or nonfunctional. immediate attention is required.
  2. Severity 2 (High) : Defect is causing major obstruction of system operations.
  3. Severity 3 (Medium) : Defect is causing intermittent errors in system operations.
  4. Severity 4 (Low) : Defect is causing infrequent interuptions in system operations.
  5. Severity 5 (Trival) : Defect is not causing any interruptions to system operations, but none-the-less is a bug.

Description

When we run tempest tests nightly, a test may occasionally hang forever. If verbosity is turned up and pytest is printing to stdout, you may see the last few calls as follows:

FAILED2017-03-09 15:54:48.113 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/d7225337-b497-4f02-b137-6a2d857d9fe3/statuses 0.015s
2017-03-09 15:54:48.128 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/589f4dd6-e573-4506-a8e6-2b755bae1c00/statuses 0.014s
2017-03-09 15:54:48.142 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/03f99da1-42e7-48dd-ac22-ccb83a95c57c/statuses 0.013s
2017-03-09 15:54:48.155 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/e63fcf05-9a49-46b6-9ac4-dd9919296abc/statuses 0.013s
2017-03-09 15:54:48.171 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/533be03c-c387-4a69-b16c-d18fa742b5c6/statuses 0.015s
2017-03-09 15:54:48.187 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/11e103cf-c74f-417e-9561-2a3d475344d8/statuses 0.015s
2017-03-09 15:54:48.202 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/7ea5a29f-bf81-4973-8b3b-2cb61a2da2d4/statuses 0.014s
2017-03-09 15:54:48.215 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b7a085bc-ed23-4cc0-a8ef-49fec513e313/statuses 0.013s
2017-03-09 15:54:48.427 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 200 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b5d21f67-860d-4400-a8b4-f28bf2fd7243/statuses 0.211s
2017-03-09 15:54:49.137 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 204 DELETE http://10.190.4.135:9696/v2.0/lbaas/pools/33249857-8853-4c7d-8836-125e0c8b717b 0.709s
2017-03-09 15:54:49.313 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 200 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b5d21f67-860d-4400-a8b4-f28bf2fd7243 0.175s
2017-03-09 15:54:50.427 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 200 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b5d21f67-860d-4400-a8b4-f28bf2fd7243 0.111s
2017-03-09 15:54:51.131 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 204 DELETE http://10.190.4.135:9696/v2.0/lbaas/listeners/bb307b73-c9a1-4bf7-a8e6-f96ca0713610 0.703s
2017-03-09 15:54:51.222 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 200 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b5d21f67-860d-4400-a8b4-f28bf2fd7243 0.090s
2017-03-09 15:54:52.438 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 200 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b5d21f67-860d-4400-a8b4-f28bf2fd7243 0.214s
2017-03-09 15:54:52.876 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 204 DELETE http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b5d21f67-860d-4400-a8b4-f28bf2fd7243 0.437s
2017-03-09 15:54:52.890 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/lbaas/loadbalancers/b5d21f67-860d-4400-a8b4-f28bf2fd7243 0.014s
2017-03-09 15:54:53.017 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 404 GET http://10.190.4.135:9696/v2.0/ports/f17b0ecb-6a9c-45b5-91ab-9c689f0f93fc 0.126s
/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
2017-03-09 15:54:53.545 7052 INFO tempest.lib.common.rest_client [req-7e2c609c-3ae2-49bf-818f-91568a49bae0 - - - - -] Request (ServiceBuilderTestJSON:tearDownClass): 204 DELETE http://10.190.4.135:9696/v2.0/subnets/374c6c68-fb54-4d25-8c63-31af60cf7bb0 0.509s

As far as we know, the test never recovers, nor does it time out. If it did time out, as succeeding tests would also fail because at that point, RPC communication between the driver and agent is effectively down. Neutron commands (such as to create a port) still work, but neutron-lbaas write commands, such as create/update, hang (read commands still work).

The only way to remedy the issue is to restart the neutron-server. That often takes some time while it's waiting for the broken queues to time out. Here is the exception shown in the neutron server log:

2017-02-19 16:15:44.361 23680 DEBUG neutron.callbacks.manager [req-7d57011b-5bc8-4497-b832-e8e22f029014 0e24d90e9ce8443cb2bb83df3174a6e7 b3ad94b5eef5485494bee4c71fa40b25 - - -] Notify callbacks for port, after_create _notify_loop /usr/lib/python2.7/site-packages/neutron/callbacks/manager.py:138
2017-02-19 16:15:44.362 23680 DEBUG neutron.callbacks.manager [req-7d57011b-5bc8-4497-b832-e8e22f029014 0e24d90e9ce8443cb2bb83df3174a6e7 b3ad94b5eef5485494bee4c71fa40b25 - - -] Calling callback neutron.db.l3_dvrscheduler_db._notify_l3_agent_new_port _notify_loop /usr/lib/python2.7/site-packages/neutron/callbacks/manager.py:145
2017-02-19 16:15:44.362 23680 DEBUG neutron.db.l3_dvrscheduler_db [req-7d57011b-5bc8-4497-b832-e8e22f029014 0e24d90e9ce8443cb2bb83df3174a6e7 b3ad94b5eef5485494bee4c71fa40b25 - - -] Received port after_create _notify_l3_agent_new_port /usr/lib/python2.7/site-packages/neutron/db/l3_dvrscheduler_db.py:345
2017-02-19 16:15:44.386 23680 DEBUG oslo_messaging._drivers.amqpdriver [req-7d57011b-5bc8-4497-b832-e8e22f029014 0e24d90e9ce8443cb2bb83df3174a6e7 b3ad94b5eef5485494bee4c71fa40b25 - - -] CAST unique_id: ec0cb81af65843ce99fec7c3a47b1910 FANOUT topic 'q-agent-notifier-security_group-update' _send /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:443
2017-02-19 16:15:44.674 23690 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 39a49b5d40a34eebb7e6014a571a5f22 reply to reply_0b6f3b4782a74889a5a0e2957dc68246 __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:201
2017-02-19 16:15:44.683 23690 DEBUG oslo_messaging._drivers.amqpdriver [req-2338d345-bd74-40f5-8cd3-48a40eb25bc1 - - - - -] sending reply msg_id: 39a49b5d40a34eebb7e6014a571a5f22 reply queue: reply_0b6f3b4782a74889a5a0e2957dc68246 time elapsed: 0.00769564600023s _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:74
2017-02-19 16:15:51.148 23692 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: 276c3988a3474754b587d05a13f6753b reply to reply_bb387cacc01e49d3bb92fa886e6c053b __call__ /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:201
2017-02-19 16:15:51.156 23692 DEBUG oslo_messaging._drivers.amqpdriver [req-a42b6eae-9b41-42ff-b577-af055ca259c3 - - - - -] sending reply msg_id: 276c3988a3474754b587d05a13f6753b reply queue: reply_bb387cacc01e49d3bb92fa886e6c053b time elapsed: 0.00783046599963s _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:74
2017-02-19 16:15:51.158 23690 ERROR oslo.messaging._drivers.impl_rabbit [req-2338d345-bd74-40f5-8cd3-48a40eb25bc1 - - - - -] Failed to publish message to topic 'reply_0b6f3b4782a74889a5a0e2957dc68246': Received 0x01 while expecting 0xce
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher [req-2338d345-bd74-40f5-8cd3-48a40eb25bc1 - - - - -] Exception during message handling: Received 0x01 while expecting 0xce
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     incoming.message))
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 100, in reply
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     log_failure=log_failure)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 75, in _send_reply
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     conn.direct_send(self.reply_q, rpc_common.serialize_msg(msg))
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1187, in direct_send
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatchbreaux_overcloud_1726er     exchange, msg, routing_key=msg_id)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1067, in _ensure_publishing
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     self.ensure(method, retry=retry, error_callback=_error_callback)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 747, in ensure
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     ret, channel = autoretry_method()
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 436, in _ensured
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     return fun(*args, **kwargs)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 508, in __call__
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     return fun(*args, channel=channels[0], **kwargs), channels[0]
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 723, in execute_method
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     method()
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1162, in _publish_and_raises_on_missing_exchange
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     timeout=timeout)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1108, in _publish
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     compression=self.kombu_compression)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     routing_key, mandatory, immediate, exchange, declare)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 188, in _publish
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     mandatory=mandatory, immediate=immediate,
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 2130, in basic_publish_confirm
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     self.wait([(60, 80)])
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     self.channel_id, allowed_methods)
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 240, in _wait_method
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     self.method_reader.read_method()
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher     raise m
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher UnexpectedFrame: Received 0x01 while expecting 0xce
2017-02-19 16:15:51.159 23690 ERROR oslo_messaging.rpc.dispatcher 
2017-02-19 16:15:51.163 23690 ERROR oslo_messaging._drivers.common [req-2338d345-bd74-40f5-8cd3-48a40eb25bc1 - - - - -] Returning exception Received 0x01 while expecting 0xce to caller
2017-02-19 16:15:51.163 23690 ERROR oslo_messaging._drivers.common [req-2338d345-bd74-40f5-8cd3-48a40eb25bc1 - - - - -] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply\n    incoming.message))\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 100, in reply\n    log_failure=log_failure)\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 75, in _send_reply\n    conn.direct_send(self.reply_q, rpc_common.serialize_msg(msg))\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1187, in direct_send\n    exchange, msg, routing_key=msg_id)\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1067, in _ensure_publishing\n    self.ensure(method, retry=retry, error_callback=_error_callback)\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 747, in ensure\n    ret, channel = autoretry_method()\n', '  File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 436, in _ensured\n    return fun(*args, **kwargs)\n', '  File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 508, in __call__\n    return fun(*args, channel=channels[0], **kwargs), channels[0]\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 723, in execute_method\n    method()\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1162, in _publish_and_raises_on_missing_exchange\n    timeout=timeout)\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1108, in _publish\n    compression=self.kombu_compression)\n', '  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 172, in publish\n    routing_key, mandatory, immediate, exchange, declare)\n', '  File "/usr/lib/python2.7/site-packages/kombu/messaging.py", line 188, in _publish\n    mandatory=mandatory, immediate=immediate,\n', '  File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 2130, in basic_publish_confirm\n    self.wait([(60, 80)])\n', '  File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 67, in wait\n    self.channel_id, allowed_methods)\n', '  File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 240, in _wait_method\n    self.method_reader.read_method()\n', '  File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 189, in read_method\n    raise m\n', 'UnexpectedFrame: Received 0x01 while expecting 0xce\n']
2017-02-19 16:15:51.164 23690 DEBUG oslo_messaging._drivers.amqpdriver [req-2338d345-bd74-40f5-8cd3-48a40eb25bc1 - - - - -] sending reply msg_id: 39a49b5d40a34eebb7e6014a571a5f22 reply queue: reply_0b6f3b4782a74889a5a0e2957dc68246 time elapsed: 6.488788711s _send_reply /usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py:74

There a few flavors of exception messages we see, mostly centered around 'Received 0x00' or 'Received 0x04' etc.... Sometimes, this exception alone is not enough to break the RPC communication channels between the agent and the driver, but it often does.

The issue seems to happen regardless of BIGIP version, but we have not yet determined if it is just as common in an undercloud scenario as it is overcloud. Primarily, in nightly tests, we see the issue in overcloud, but that's likely because the overcloud tests run first.

Also note that the topic that is called out in the exception is a reply topic.

pjbreaux commented 7 years ago

The issue is fairly difficult to reproduce when running tests manually, although certain tests seem to produce the problem more than others. One such test is here:

https://github.com/F5Networks/f5-openstack-lbaasv2-driver/blob/mitaka/f5lbaasdriver/test/tempest/tests/api/test_load_balancers_admin.py#L75

It is a curious test because it creates two loadbalancers, although the original intention was likely not to have two loadbalancers in the test. Regardless, the failure often occurs around VIP port teardown.

pjbreaux commented 7 years ago

Other things to note. This one is very important We've seen this issue arise simply after installing the agent and driver code and restarting the neutron server and the f5-openstack-agent. No tests were running and no tests had been run against the stack. This crops up in nightly from time to time, and manifests as the very first tempest test we run (usually an L7 api test) stalls.

pjbreaux commented 7 years ago

The things we have tried to mitigate or work around this issue are varied. One major change is that we updated neutron to the 8.3.0 release. The RPC error happened in this deployment as well.

pjbreaux commented 7 years ago

Occasionally, we see these types of messages in the log:

https://github.com/openstack/oslo.messaging/blob/8d42484e6f153c57b1216a7642841e7e6a77e887/oslo_messaging/_drivers/impl_rabbit.py#L748

Implying that a new oslo_messaging process, or at least a new instance of the rabbit driver, has been forked. As of now, we are unsure of how a new process is forked.

pjbreaux commented 7 years ago

When running against commit hashes in driver of 7d736d162cca30eb000e0b2a32266dde9ef7ebdc and agent 15c7bc39d32a2bf7c7485e60c15bbabcd75bded1, encountered the problem as well. Those date to early December.

zancas commented 7 years ago

The new fork warning is interesting. Is it possible that we're spawning new oslo_messaging processes? That would be probably be in the agent, not the driver, and would likely be a bug.

zancas commented 7 years ago

@pjbreaux under what regime (test conditions) do we see the warning message about and oslo_messaging instance with a mismatched PID?

pjbreaux commented 7 years ago

The make scripts that I run to excite this issue most often is in systest/scripts/run_neutron_lbaas.sh. Running that in a loop should be able to reproduce it within a reasonable amount of time (such as 8 hours).

swormke commented 7 years ago

Adding this link for reference as we have seen the frame_too_large problem this describes: http://john.eckersberg.com/debugging-rabbitmq-frame_too_large-error.html