Closed pjbreaux closed 7 years ago
The issue is fairly difficult to reproduce when running tests manually, although certain tests seem to produce the problem more than others. One such test is here:
It is a curious test because it creates two loadbalancers, although the original intention was likely not to have two loadbalancers in the test. Regardless, the failure often occurs around VIP port teardown.
Other things to note. This one is very important We've seen this issue arise simply after installing the agent and driver code and restarting the neutron server and the f5-openstack-agent. No tests were running and no tests had been run against the stack. This crops up in nightly from time to time, and manifests as the very first tempest test we run (usually an L7 api test) stalls.
The things we have tried to mitigate or work around this issue are varied. One major change is that we updated neutron to the 8.3.0 release. The RPC error happened in this deployment as well.
Occasionally, we see these types of messages in the log:
Implying that a new oslo_messaging process, or at least a new instance of the rabbit driver, has been forked. As of now, we are unsure of how a new process is forked.
When running against commit hashes in driver of 7d736d162cca30eb000e0b2a32266dde9ef7ebdc
and agent 15c7bc39d32a2bf7c7485e60c15bbabcd75bded1
, encountered the problem as well. Those date to early December.
The new fork warning is interesting. Is it possible that we're spawning new oslo_messaging processes? That would be probably be in the agent, not the driver, and would likely be a bug.
@pjbreaux under what regime (test conditions) do we see the warning message about and oslo_messaging instance with a mismatched PID?
The make scripts that I run to excite this issue most often is in systest/scripts/run_neutron_lbaas.sh. Running that in a loop should be able to reproduce it within a reasonable amount of time (such as 8 hours).
Adding this link for reference as we have seen the frame_too_large
problem this describes:
http://john.eckersberg.com/debugging-rabbitmq-frame_too_large-error.html
Agent Version
Have seen it on agent releases 9.2.0 and 9.3.0.b2
Operating System
CentOS 7
OpenStack Release
both liberty and mitaka encounter this issue
Bug Severity
1
Severity: <Fill in level: 1 through 5>
Severity level definitions:
Description
When we run tempest tests nightly, a test may occasionally hang forever. If verbosity is turned up and pytest is printing to stdout, you may see the last few calls as follows:
As far as we know, the test never recovers, nor does it time out. If it did time out, as succeeding tests would also fail because at that point, RPC communication between the driver and agent is effectively down. Neutron commands (such as to create a port) still work, but neutron-lbaas write commands, such as create/update, hang (read commands still work).
The only way to remedy the issue is to restart the neutron-server. That often takes some time while it's waiting for the broken queues to time out. Here is the exception shown in the neutron server log:
There a few flavors of exception messages we see, mostly centered around 'Received 0x00' or 'Received 0x04' etc.... Sometimes, this exception alone is not enough to break the RPC communication channels between the agent and the driver, but it often does.
The issue seems to happen regardless of BIGIP version, but we have not yet determined if it is just as common in an undercloud scenario as it is overcloud. Primarily, in nightly tests, we see the issue in overcloud, but that's likely because the overcloud tests run first.
Also note that the topic that is called out in the exception is a reply topic.