hub module creates loops upon link failure / recovery

colin-scott commented 10 years ago

Hey,

We found what appears to be a bug in pyretic's hub module (proactive0 mode) while running some experiments.

We discovered a loop in the network while fuzz testing pyretic's pyretic.modules.hub module using STS on a 3-switch mesh topology.

After minimizing the trace generated from fuzz testing, we found that the events that triggered the loop were a link failure followed by a link recovery.

We took a brief look at pyretic's code, and we believe the root cause is related to the invocation at line 450 of pyretic/core/network.py:

self.reconcile_attributes(topology) # Root Cause?

Our understanding of this bug is that reconcile_attributes is neglecting to filter out down links, which is ultimately causing the MST to be computed improperly. So for example, on a 3-switch mesh, suppose pyretic initially computes the MST s1 <-> s2 <-> s3. Then, link s1 <-> s2 goes down, so pyretic recomputes a new MST s1 <-> s3 <-> s2. Then, the link goes back up. reconconcile_attributes neglects to account for the failed/recovered link, so we end up with flow entries in the network s1 <-> s3 <-> s2 <-> s1, which forms a loop. The loop seems to stay around for some time, until other events in the network cause pyretic to correct its mistake, or until pyretic periodically flushes all flow entries.

Can you verify that this is indeed a bug?

Steps for reproducing:

First, we need to fix a minor issue. Unfortunately, STS depends on a different version of POX than pyretic, both of which will be in PYTHONPATH. In pyretic.py, hardcode poxpath rather than searching PYTHONPATH, e.g.:

171         poxpath = '/home/mininet/pox'
172         #for p in output.split(':'):
173         #     if re.match('.*pox/?$',p):
174         #         poxpath = os.path.abspath(p)
175         #         break

Then clone STS and replay the minimized trace:

$ git clone git://github.com/ucb-sts/sts.git
$ cd sts
$ git clone -b frenetic_test git://github.com/ucb-sts/pox.git
$ (git submodule init && git submodule update && cd sts/hassel/hsa-python && source setup.sh)
$ git clone git://github.com/ucb-sts/experiments.git
# Assumes pyretic is at ../pyretic/
$ ./simulator.py -c experiments/new_pyretic_loop_mcs/replay_config.py

Throughout the replay you can see both STS's and pyretic's console output (pyretic's is in orange). You can also view the console output offline at experiments/new_pyretic_loop_mcs_replay/simulator.out

At the end of the replay you can examine the flow entries to see the loop:

STS [next] >show_flows 1
--------------------------------------------------------------------------------------------------------------------------------
|  Prio | in_port | dl_type | dl_src | dl_dst | nw_proto | nw_src | nw_dst | tp_src | tp_dst |                         actions |
--------------------------------------------------------------------------------------------------------------------------------
...
| 59995 |       2 |    None |   None |   None |     None |   None |   None |   None |   None |   output(1), IN_PORT, output(3) |
--------------------------------------------------------------------------------------------------------------------------------
STS [next] >show_flows 2
--------------------------------------------------------------------------------------------------------------------------------
|  Prio | in_port | dl_type | dl_src | dl_dst | nw_proto | nw_src | nw_dst | tp_src | tp_dst |                         actions |
--------------------------------------------------------------------------------------------------------------------------------
...
| 59999 |       1 |    None |   None |   None |     None |   None |   None |   None |   None |            output(2), output(3) |
--------------------------------------------------------------------------------------------------------------------------------
STS [next] >show_flows 3
---------------------------------------------------------------------------------------------------------------------
|  Prio | in_port | dl_type | dl_src | dl_dst | nw_proto | nw_src | nw_dst | tp_src | tp_dst |              actions |
---------------------------------------------------------------------------------------------------------------------
| 59995 |    None |    None |   None |   None |     None |   None |   None |   None |   None | output(1), output(3) |
---------------------------------------------------------------------------------------------------------------------

You can also see examine the network topology:

STS [next] > topology.network_links
[(1:1) -> (3:2), (1:2) -> (2:2), (3:2) -> (1:1), (2:2) -> (1:2), (2:1) -> (3:1), (3:1) -> (2:1)]

And verify that the invariant violation is there:

STS [next] > inv loops

Thanks!

joshreich commented 10 years ago

Hi Colin,

Thanks so much for reporting this! It is rare we get a bug report that is so clearly and completely documented (but I guess that's the whole point of STS ;-). I'm traveling today and tomorrow but will try to look into it by Wed. evening, though it's possible another member of the team will have time before I do.

Cheers, -Josh

colin-scott commented 10 years ago

I also just noticed this on one of delta debugging's iterations for the original trace:

[c1]   File "/home/mininet/pyretic/of_client/pox_client.py", line 512, in _handle_ConnectionUp
[c1]     assert event.dpid not in self.switches
[c1] AssertionError

colin-scott commented 10 years ago

Any chance you could verify that this is a bug before Friday? I'm guessing you're probably working on your own submission so I understand if you don't have any time, but hopefully this shouldn't take too long.

Thanks! -Colin

joshreich commented 10 years ago

Hi Colin,

I thought @nkatta response had been recorded here, but I see that it hasn't. I'm inlining that response below, which verifies that you have detected at least one bug in our topology detection code. I've also assigned this issue to @nkatta and @SiGe who will hopefully look into the assertion error above (you are right, I'm pushing for an end-of-the-week deadline myself), though I'm pretty sure it's a genuine bug as well.

Hi guys,

Omid and I were looking at the origins of the bug and the following seems to be the case :

Pyretic is basicaly installing rules that forward packets on non-existent links — even before the network topology is discovered, we are installing rules that flood packets on all the ports of a switch. This ends up creating a loop in the installed policy. Doing away with such loops in the bootstrap phase seems to be particularly tricky until you discover the entire topology. So, one possible solution is that we wait (with the help of a timer — a best effort scenario) till we discover the topology (by sending LLDP packets).

-Naga

colin-scott commented 10 years ago

Awesome, thanks guys!

joshreich commented 10 years ago

very welcome and good luck! (it looks like you are doing really neat stuff :-)

nkatta commented 10 years ago

Hi Colin,

My take on this is that there is a bug (in the Header Space Analysis sense) - Pyretic installs a policy that can make packets go in loop when topology changes occur. So we should make sure we do not install such rules. A “strict” solution in this case would be to treat “ports” that are not linked to another port (by topology discovery) as “down” ports and not forward packets on those ports. But this would mean we cannot forward packets to end hosts (because they don’t respond to LLDP packets). But we do not want that! Another “non-strict” solution alternative is to treat the topology bootstrap process differently and set a timer so that we wait till the entire topology is discovered and then produce a policy that can be installed on the network. However, this might not solve the problem entirely because you may not discover the topology entirely before the timer expires.

The question is it possible to do away with this bug entirely? The answer seems to be “No” (?). There is always a timer that is set before you decide you discovered the entire topology using LLDP packets and then decide to install your policy on the switches. It can always happen that the topology discovery may not complete before the timer expires and hence one can produce a packet that might either end up in loops (if we do it the “non-strict" way) or it may actually isolate end-hosts entirely (if we did it the strict way). My guess is that there is a similar issue with every other controller that is out there because all of them should depend on timers of some sort for topology discovery.

Thanks Naga

colin-scott commented 10 years ago

Makes sense. But I'm not sure that the distinction between "learning the topology" and "topology has been learned" is meaningful, simply because the topology can change at any time (i.e. links can go down or come up at any point), and there is a delay between those changes and when the controller learns about them. In other words, the controller is always bootstrapping its knowledge -- it never really stops.

If we place a priori resrictions on which ports hosts can be attached to, would you be able to avoid the loops?

SiGe commented 10 years ago

Colin, I have a question regarding the show_flows command.

We always install higher priority rules first (wireshark supports this as well), but in the show_flows some of the lower priority rules are installed while the higher priority ones are missing. I am not quite sure how this might happen. Do you have any insights that might help?

On Tue, Jan 28, 2014 at 7:03 PM, Colin Scott notifications@github.comwrote:

Makes sense. But I'm not sure that the distinction between "learning the topology" and "topology has been learned" is meaningful, simply because the topology can change at any time (i.e. links can go down or come up at any point), and there is a delay between those changes and when the controller learns about them.

If we place a priori resrictions on which ports hosts can be attached to, would you be able to avoid the loops?

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33542669 .

nkatta commented 10 years ago

@colin-scott I sort of abused the terminology there -- it is indeed not meaningful to say that I have learned the topology at some stage (as the topology does evolve all the time). What I meant was that there should be some way to recognize where the hosts are connected as you were suggesting. That should help avoid loops.

colin-scott commented 10 years ago

If you run sts in verbose mode (pass -v on the command line) it should print the flow table of each switch whenever a flow_mod arrives. You could trace through the console output to see when the higher priority flow_mods are installed or uninstalled.

On Tue, Jan 28, 2014 at 5:33 PM, Omid Alipourfard notifications@github.comwrote:

Collin, I have a question regarding the show_flows command.

We always install higher priority rules first (wireshark supports this as well), but in the show_flows some of the lower priority rules are installed while the higher priority ones are missing. I am quite not sure how this might happen. Do you have any insights that might help?

On Tue, Jan 28, 2014 at 7:03 PM, Colin Scott <notifications@github.com

wrote:

Makes sense. But I'm not sure that the distinction between "learning the topology" and "topology has been learned" is meaningful, simply because the topology can change at any time (i.e. links can go down or come up at any point), and there is a delay between those changes and when the controller learns about them.

If we place a priori resrictions on which ports hosts can be attached to, would you be able to avoid the loops?

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33542669>

.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548196 .

princedpw commented 10 years ago

We might sit down and try to think of a principled solution to this problem in an abstract setting. Try to give precise definitions of "host port" and "internal port" and "inactive port" based on packets seen arriving at such ports (or by external fiat). Then enforce constraints on policy or rule installation based on those definitions.

Dave

On Tue, Jan 28, 2014 at 8:39 PM, Colin Scott notifications@github.comwrote:

If you run sts in verbose mode (pass -v on the command line) it should print the flow table of each switch whenever a flow_mod arrives. You could trace through the console output to see when the higher priority flow_mods are installed or uninstalled.

On Tue, Jan 28, 2014 at 5:33 PM, Omid Alipourfard notifications@github.comwrote:

Collin, I have a question regarding the show_flows command.

We always install higher priority rules first (wireshark supports this as well), but in the show_flows some of the lower priority rules are installed while the higher priority ones are missing. I am quite not sure how this might happen. Do you have any insights that might help?

On Tue, Jan 28, 2014 at 7:03 PM, Colin Scott <notifications@github.com

wrote:

Makes sense. But I'm not sure that the distinction between "learning the topology" and "topology has been learned" is meaningful, simply because the topology can change at any time (i.e. links can go down or come up at any point), and there is a delay between those changes and when the controller learns about them.

If we place a priori resrictions on which ports hosts can be attached to, would you be able to avoid the loops?

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33542669

.

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548196>

.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548498 .

AghOmid commented 10 years ago

Hello Colin,

It seems like that the order that the OpenFlow rules get installed are not the same as the order that we sent them to the switch. Is there anything on STS side that might cause this?

Adding a timer in between rule installations seems to fix the loop invariant violation. Does STS reorder the FlowMods in someway? [Wireshark shows the same ordering, while STS shows a different ordering when rules are installed.]

Wireshark: the first flow mod that was sent: 60000 STS: The first flow mod that was installed: 59996

On Tue, Jan 28, 2014 at 9:11 PM, David Walker notifications@github.comwrote:

We might sit down and try to think of a principled solution to this problem in an abstract setting. Try to give precise definitions of "host port" and "internal port" and "inactive port" based on packets seen arriving at such ports (or by external fiat). Then enforce constraints on policy or rule installation based on those definitions.

Dave

On Tue, Jan 28, 2014 at 8:39 PM, Colin Scott <notifications@github.com

wrote:

If you run sts in verbose mode (pass -v on the command line) it should print the flow table of each switch whenever a flow_mod arrives. You could trace through the console output to see when the higher priority flow_mods are installed or uninstalled.

On Tue, Jan 28, 2014 at 5:33 PM, Omid Alipourfard notifications@github.comwrote:

Collin, I have a question regarding the show_flows command.

We always install higher priority rules first (wireshark supports this as well), but in the show_flows some of the lower priority rules are installed while the higher priority ones are missing. I am quite not sure how this might happen. Do you have any insights that might help?

On Tue, Jan 28, 2014 at 7:03 PM, Colin Scott <notifications@github.com

wrote:

Makes sense. But I'm not sure that the distinction between "learning the topology" and "topology has been learned" is meaningful, simply because the topology can change at any time (i.e. links can go down or come up at any point), and there is a delay between those changes and when the controller learns about them.

If we place a priori resrictions on which ports hosts can be attached to, would you be able to avoid the loops?

Reply to this email directly or view it on GitHub<

https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33542669

.

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548196

.

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548498>

.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33550149 .

princedpw commented 10 years ago

we need consistent updates.

On Tue, Jan 28, 2014 at 9:24 PM, Omid notifications@github.com wrote:

Hello Collin,

It seems like that the order that the OpenFlow rules get installed are not the same as the order that we sent them to the switch. Is there anything on STS side that might cause this?

Adding a timer in between rule installations seems to fix the loop invariant violation. Does STS reorder the FlowMods in someway? [Wireshark shows the same ordering, while STS shows a different ordering when rules are installed.]

Wireshark: the first flow mod that was sent: 60000 STS: The first flow mod that was installed: 59996

On Tue, Jan 28, 2014 at 9:11 PM, David Walker <notifications@github.com

wrote:

We might sit down and try to think of a principled solution to this problem in an abstract setting. Try to give precise definitions of "host port" and "internal port" and "inactive port" based on packets seen arriving at such ports (or by external fiat). Then enforce constraints on policy or rule installation based on those definitions.

Dave

On Tue, Jan 28, 2014 at 8:39 PM, Colin Scott <notifications@github.com

wrote:

If you run sts in verbose mode (pass -v on the command line) it should print the flow table of each switch whenever a flow_mod arrives. You could trace through the console output to see when the higher priority flow_mods are installed or uninstalled.

On Tue, Jan 28, 2014 at 5:33 PM, Omid Alipourfard notifications@github.comwrote:

Collin, I have a question regarding the show_flows command.

We always install higher priority rules first (wireshark supports this as well), but in the show_flows some of the lower priority rules are installed while the higher priority ones are missing. I am quite not sure how this might happen. Do you have any insights that might help?

On Tue, Jan 28, 2014 at 7:03 PM, Colin Scott < notifications@github.com

wrote:

Makes sense. But I'm not sure that the distinction between "learning the topology" and "topology has been learned" is meaningful, simply because the topology can change at any time (i.e. links can go down or come up at any point), and there is a delay between those changes and when the controller learns about them.

If we place a priori resrictions on which ports hosts can be attached to, would you be able to avoid the loops?

Reply to this email directly or view it on GitHub<

https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33542669

.

Reply to this email directly or view it on GitHub<

https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548196

.

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548498

.

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33550149>

.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33550675 .

jnfoster commented 10 years ago

"We might sit down and try to think of a principled solution to this problem"

"we need consistent updates."

Indeed. Party like it's January 2012.

-N

On Tue, Jan 28, 2014 at 9:42 PM, David Walker notifications@github.comwrote:

we need consistent updates.

On Tue, Jan 28, 2014 at 9:24 PM, Omid notifications@github.com wrote:

Hello Collin,

It seems like that the order that the OpenFlow rules get installed are not the same as the order that we sent them to the switch. Is there anything on STS side that might cause this?

Adding a timer in between rule installations seems to fix the loop invariant violation. Does STS reorder the FlowMods in someway? [Wireshark shows the same ordering, while STS shows a different ordering when rules are installed.]

Wireshark: the first flow mod that was sent: 60000 STS: The first flow mod that was installed: 59996

On Tue, Jan 28, 2014 at 9:11 PM, David Walker <notifications@github.com

wrote:

We might sit down and try to think of a principled solution to this problem in an abstract setting. Try to give precise definitions of "host port" and "internal port" and "inactive port" based on packets seen arriving at such ports (or by external fiat). Then enforce constraints on policy or rule installation based on those definitions.

Dave

On Tue, Jan 28, 2014 at 8:39 PM, Colin Scott <notifications@github.com

wrote:

If you run sts in verbose mode (pass -v on the command line) it should print the flow table of each switch whenever a flow_mod arrives. You could trace through the console output to see when the higher priority flow_mods are installed or uninstalled.

On Tue, Jan 28, 2014 at 5:33 PM, Omid Alipourfard notifications@github.comwrote:

Collin, I have a question regarding the show_flows command.

We always install higher priority rules first (wireshark supports this as well), but in the show_flows some of the lower priority rules are installed while the higher priority ones are missing. I am quite not sure how this might happen. Do you have any insights that might help?

On Tue, Jan 28, 2014 at 7:03 PM, Colin Scott < notifications@github.com

wrote:

Makes sense. But I'm not sure that the distinction between "learning the topology" and "topology has been learned" is meaningful, simply because the topology can change at any time (i.e. links can go down or come up at any point), and there is a delay between those changes and when the controller learns about them.

If we place a priori resrictions on which ports hosts can be attached to, would you be able to avoid the loops?

Reply to this email directly or view it on GitHub<

https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33542669

.

Reply to this email directly or view it on GitHub<

https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548196

.

Reply to this email directly or view it on GitHub<

https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548498

.

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33550149

.

Reply to this email directly or view it on GitHub< https://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33550675>

.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33551442 .

SiGe commented 10 years ago

@princedpw @jnfoster Haha, I guess this brings back memories.

@princedpw We need consistent updates, but I feel like this problem goes a bit beyond that. It seems like that either STS or the software switch is not preserving the order that flow mods are installed. Doesn't TCP already guarantee that packets are received in order?

reitblatt commented 10 years ago

Messages are guaranteed to be delivered in order via TCP, but OF does not require that messages be processed in order. This is to allow switches to optimally reschedule messages depending upon their HW limitations. For example, a switch may not be able to process two FLOW_MOD that match on IP simultaneously, but it can install an ethernet FLOW_MOD and an IP FLOW_MOD simultaneously (because they hit different HW tables). If message ordering matters, then you have to use a BARRIER.

On Tue, Jan 28, 2014 at 9:55 PM, Omid Alipourfard notifications@github.comwrote:

@princedpw https://github.com/princedpw @jnfosterhttps://github.com/jnfosterHaha, I guess this brings back memories.

@princedpw https://github.com/princedpw We need consistent updates, but I feel like this problem goes a bit beyond that. Ot seems like that either STS or the software switch is not preserving the order that flow mods are installed. Doesn't TCP already guarantee that packets are received in order?

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33551973 .

SiGe commented 10 years ago

@reitblatt, I tried to send barrier messages in between install_rules. Rules still get installed in the wrong order. The only explanation that I could come up with is that something is changing the order that OF messages are getting sent.

reitblatt commented 10 years ago

Must be a bug. If you send a BARRIER between each FLOW_MOD, then there should be no reordering. I noticed in experiments a while ago that wireshark missed some OF messages (in that case BARRIER_REQ). I never tracked down that bug.

On Tue, Jan 28, 2014 at 10:04 PM, Omid Alipourfard <notifications@github.com

wrote:

@reitblatt https://github.com/reitblatt, I tried to send barrier messages in between install_rules. Rules still get installed in the wrong order. The only explanation that I could come up with is that something is changing the order that OF messages are getting sent.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33552344 .

colin-scott commented 10 years ago

@jnfoster Isn't it the case that consistent updates are intended only for planned changes? (whereas link failures are unplanned?)

@SiGe By default STS' switches should always install flow_mods in-order, immediately upon receiving them. There is an option to have them reorder flow_mods (between barrier_in's), but I can't think of why that would be enabled. You're sure that with verbose mode the flow_mods shown on the console show up in the wrong order?

colin-scott commented 10 years ago

You can also view the order the flow_mods are installed from the original trace with:

./tools/pretty_print_event_trace.py experiments/new_pyretic_loop_mcs/mcs.trace

SiGe commented 10 years ago

Colin,

Going back home now. I'll send you log files of sts/wireshark sometime tonight.

I'll also check the trace file. On Jan 28, 2014 10:14 PM, "Colin Scott" notifications@github.com wrote:

@jnfoster https://github.com/jnfoster Isn't it the case that consistent updates are intended only for planned changes? (whereas link failure are unplanned)

@SiGe https://github.com/SiGe By default STS' switches should always install flow_mods in-order, immediately upon receiving them. There is an option to have them reorder flow_mods (between barrier_in's), but I can't think of why that would be enabled. You're sure that with verbose mode the flow_mods shown on the console are processed in the wrong order?

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33552751 .

jnfoster commented 10 years ago

@jnfoster https://github.com/jnfoster Isn't it the case that consistent updates are intended only for planned changes? (whereas link failure are unplanned)

Not at all! Semantically, a consistent update is just a sequence of instructions that possesses certain properties. There are many possible implementations. Some implementations, like two-phase update, probably do work best with planned change. But you could also implement a consistent update using other features like fast-failover groups that work well with unexpected situations like link failures.

The point is, with any consistent update, you would be sure that no loops are introduced as long as the old and new policies are loop-free.

-N

colin-scott commented 10 years ago

Gotcha, makes sense

mcanini commented 10 years ago

What switch are you running on? Certain OF agents are known to not implement barriers correctly.

-Marco

On Wed, Jan 29, 2014 at 4:04 AM, Omid Alipourfard notifications@github.comwrote:

@reitblatt https://github.com/reitblatt, I tried to send barrier messages in between install_rules. Rules still get installed in the wrong order. The only explanation that I could come up with is that something is changing the order that OF messages are getting sent.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33552344 .

mcanini commented 10 years ago

That is reminiscent of how STP works in a LAN. Each port goes through the learning phase before it is considered safe to forward traffic on that while guaranteeing loop freedom.

-Marco

On Wed, Jan 29, 2014 at 2:39 AM, Naga Praveen Katta < notifications@github.com> wrote:

@colin-scott https://github.com/colin-scott I sort of abused the terminology there -- it is indeed not meaningful to say that I have learned the topology at some stage (as the topology does evolve all the time). What I meant was that there should be some way to recognize where the hosts are connected as you were suggesting. That should help avoid loops.

Reply to this email directly or view it on GitHubhttps://github.com/frenetic-lang/pyretic/issues/26#issuecomment-33548493 .

colin-scott commented 10 years ago

We built our own. The latest version has been verified as openflow 1.0-compliant by oftest.

STS buffers messages between the controllers and the switches in order to maintain causality during replay. By default, during replay STS should only allow messages that are functionally equivalent to messages in the original run. I'm guessing that what @SiGe is seeing is that some packets are being sent, but not immediately allowed through to the switches since they weren't in the original run. If you want to allow new messages through, you can do so by adding "allow_unexpected_messages=True" as a parameter to Replayer in experiments/new_pyretic_loop_mcs/replay_config.py

SiGe commented 10 years ago

@reitblatt Thanks Mark, that makes sense. Maybe what @colin-scott explains why the barrier messages didn't "work".

@mcanini I am on 1.10.2. I believe this is the same version that @jnfoster was running.

colin-scott commented 10 years ago

@SiGe Incidentally, if you want to replay the trace exactly as it occurred originally, invoking

./simulator.py -c experiments/new_pyretic_loop_mcs/interactive_replay_config.py

will rerun the execution "headless" without pyretic to show the state transitions the network went through

SiGe commented 10 years ago

@colin-scott ,

Here's a summary of what I have so far.

With allow_unexpected_messages and @reitblatt's suggestion about adding a barrier after every flow_mod the loop invariant violation error that I was getting a few hours back is gone.

Although, I have started to see non-deterministic behavior. On subsequent runs, sometimes I get an error about loop invariant violation (not the same as the first one, although with the same signature), and sometimes I don't. As @princedpw, @jnfoster suggested this will probably get fixed with consistent updates.

I also think that @nkatta and @mcanini are correct in that we need to come up with a solution for identifying port status. Maybe until we "discover" the topology we should not install new rules.

I'll try to find an explanation on why the violation is still happening tomorrow.

P.S You can check my simulation output here: https://github.com/SiGe/sts_output It is the run with no loop invariant violation, but as I said, in this run I got lucky.

frenetic-lang / pyretic

hub module creates loops upon link failure / recovery #26