brunorijsman / rift-python

Routing In Fat Trees (RIFT) implementation in Python
Apache License 2.0
46 stars 24 forks source link

Interpretation of chained events is incorrect ... #5

Closed przygienda closed 6 years ago

przygienda commented 6 years ago

Bruno, chained events that you introduced are NOT special. The PUSH simply adds events at the end of the queue and they are processed after events in front of them got processed. There is nothing special about the "events that a transition generates".

The interpretation you use currently can lead to interop problems albeit the FSM is built towards very resilient self-stabilization so stable oscillations should not happen unless we'll get some very, very pathological timer sequences (albeit @ 1sec timer the FSM would have to be incredibly slow in processing) ...

brunorijsman commented 6 years ago

Can you give me a concrete example of interop problems caused by my interpretation?

I observed real problems by not following my interpretation, including the specific example in the document.

— Bruno

On Aug 8, 2018, at 8:19 AM, Tony Przygienda notifications@github.com wrote:

Bruno, chained events that you introduced are NOT special. The PUSH simply adds events at the end of the queue and they are processed after events in front of them got processed. There is nothing special about the "events that a transition generates".

The interpretation you use currently can lead to interop problems albeit the FSM is built towards very resilient self-stabilization so stable oscillations should not happen unless we'll get some very, very pathological timer sequences (albeit @ 1sec timer the FSM would have to be incredibly slow in processing) ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/brunorijsman/rift-python/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEF9N0Ixv4jCg2TDaI2NtQteHjf41meks5uOwF0gaJpZM4V0JH-.

przygienda commented 6 years ago

Of course the implementation should NOT tardy when getting events and process them "after a bunch accumulated". You got packet, you push event, you kick the FSM to process it. If you allow timers/packets/other stuff to pile up in the queue then yes, funny things start to happen when the transitions start to put events @ end of queue because you're pretty much running a "time dilution" ... Let me look @ your example in detail ...

przygienda commented 6 years ago

Yes, your problem is caused by "piling up events". You put LIE received on the event queue and you immediately push the FSM, you don't see whether "timer expired and I put that on queue as well before driving FSM". If you start to "pile up events" you'll never fix that by any tricks IME, the "preferred events" being just one patch that works in one case and causes more probelms in another. I remember having read a solid research paper ages ago where this was proven in rigorous terms but in short "NEVER reorder events on event queue" is the conclusion there ...

brunorijsman commented 6 years ago

Sorry, you are being too vague.

I need a concrete example showing that what I am doing will break interop or correctness.

At this point, I am still convinced that what I am doing is correct.

If you want, I can give you a concrete example of a state machine where not doing what I am doing will lead to provable incorrect behavior.

— Bruno

On Aug 8, 2018, at 8:24 AM, Tony Przygienda notifications@github.com wrote:

Yes, your problem is caused by "piling up events". You put LIE received on the event queue and you immediately push the FSM, you don't see whether "timer expired and I put that on queue as well before driving FSM". If you start to "pile up events" you'll never fix that by any tricks IME, the "preferred events" being just one patch that works in one case and causes more probelms in another. I remember having read a solid research paper ages ago where this was proven in rigorous terms but in short "NEVER reorder events on event queue" is the conclusion there ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/brunorijsman/rift-python/issues/5#issuecomment-411445905, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEF9MfysDQefwRWlRScUqSXkUESnClKks5uOwKagaJpZM4V0JH-.

brunorijsman commented 6 years ago

PS:

"and you immediately push the FSM”

I can do that in my implementation because that it is completely single-threaded (I may actually consider doing that instead of my current solution) but it is impossible to do in a multi-threaded environment where the event is generated in the context of one thread and the FSM transition takes place in the context of another thread (I have plenty of experience with that in my earlier Erlang applications).

— Bruno

przygienda commented 6 years ago

OK, show me an example where

What you try to do is to achieve the equivalent behavior after you queued multiple events before kicking off FSM.

imagine you're in state1

event1: state1->state2 event2: state2->state3 event3: state2->state1

state1 on processing event1 pushes event3, if you start the FSM processing in state1 with the following events occuring externally

event1, event2

you end up in state3 which cannot process event3 and that's what your "reodering of the event queue" tries to fix by "processing the pushed events first". I don't have the proof of equivalence of that here, it may work but that's not normally how things are done ...

brunorijsman commented 6 years ago

MY RESPONSE TO YOUR EXAMPLE

OK, show me an example where you process every single event that occured externally immediately on the FSM, i.e. you never queue up two events before you push the FSM and that breaks the FSM. What you try to do is to achieve the equivalent behavior after you queued multiple events before kicking off FSM.

imagine you're in state1 event1: state1->state2 event2: state2->state3 event3: state2->state1 state1 on processing event1 pushes event3, if you start the FSM processing in state1 with the following events occuring externally event1, event2 you end up in state3 which cannot process event3 and that's what your "reodering of the event queue" tries to fix by "processing the pushed events first". I don't have the proof of equivalence of that here, it may work but that's not normally how things are done ...

This is what would happen according to your rules:

Starting state: State1 Event1 happens. By your rule, I must now completely process Event1 immediately. "you never queue up two events before you push the FSM" I must not consider or enqueu any other events until Event1 is fully processed by the FSM. Processing of Event1: Event1 in State1 => push Event3, and go to State2 By your rule, I must now completely process event Event3 immediately. "you never queue up two events before you push the FSM"
I must not consider or enqueu any other events until Event3 is fully processed by the FSM. Event3 in State2 => go to State1 We are now in State1 Event1 has now been completely processed and we may consider other events. Event2 is the next event that occurs. Processing of Event2: Event2 in State1 => NO TRANSITION FOUND (Maybe you intended "Event2: State1->State3"?)

Summary: State1 -> Event1 -> State2 -> Event3 -> State1 -> Event2 -> missing transition

Note: I don't see how you end up in state3 which cannot process event3" (but see alternative interpretation below).

This is what would happen according to my rules.

There are two scenarios, depending on the exact timing of events and threads.

Both lead to the same observable behavior:

Here is scenario 1:

Starting state: State1 External Event1 happens. External Event1 is enqueud on the normal (external) event queue. Before the FSM gets a chance to look at the event queue, external Event2 already happens. External Event2 is enqueued on the normal (external) event queue behind event Event1. The FSM starts processing the enqueud events. There are no events on the chained event queue, so we pop event Event1 from the normal event queue. Processing of Event1: Event 1 in State1 => push Event3 on the chained event queue, and go to State2 We are now in State2 There now is an event on the chained event queue, so process that first. Pop event Event3 from the chained event queue. Processing of Event3: Event3 in State2 => go to State1 We are now in State1 There are no events on the chained event queue anymore, so we pop event Event2 from the normal event queue. Processing of Event2: Event2 in State1 => NO TRANSITION FOUND (Maybe you intended "Event2: State1->State3"?)

Summary: State1 -> Event1 -> State2 -> Event3 -> State1 -> Event2 -> missing transition (exactly the same as yours)

Here scenario2:

Starting state: State1 External Event1 happens. External Event1 is enqueud on the normal (external) event queue. The FSM starts processing the enqueued events before Event2 occurs. There are no events on the chained event queue, so we pop event Event1 from the normal event queue. Processing of Event1: Event 1 in State1 => push Event3 on the chained event queue, and go to State2 We are now in State2 There now is an event on the chained event queue, so process that first. Pop event Event3 from the chained event queue. Processing of Event3: Event3 in State2 => go to State1 We are now in State1 There are no events on the chained event queue anymore, and there are also no events on the normal event queue. [note1] So block until another event appears on any of the queues External Event2 happens. External Event2 is enqueued on the normal (external) event queue. The FSM starts processing the enqueud events again Processing of Event2: Event2 in State1 => NO TRANSITION FOUND (Maybe you intended "Event2: State1->State3"?)

Summary: State1 -> Event1 -> State2 -> Event3 -> State1 -> Event2 -> missing transition (exactly the same as yours)

[note1] There are other scenarios where Event2 is pushed on the normal event queue (in another thread) while Event1 or Event3 are still being processed. All these scenarios lead to the same external observable behavior.

From this I conclude that: the external observable behavior (the exact sequence of actions executed, and the exact sequence of states) is exactly the same in my implementation as in your impementation.

I still don't understand why my approach is incorrect.

THE PROBLEM WITH NOT QUEUEING IN A MULTI-THREADED ENVIRONMENT

First an example why "processing every single event that occured extrenally on the FSM" without any queueing of events is very problematic and ususually "not done" (see [Note] below) in multi-threaded environments.

Consider a Thread (T-CP) which reads control-plane messages from a TCP connection. Each Contol-Plane Message (CPM) relates to a particular Object (O). There are many such objects, and each message may relate to a different object.

Each object is modelled as an FSM instance which runs in a separate thread dedicated to that object. We assume object O1 is running its FSM in the context of thread T-O1. Object O2 runs in T-O2, etc.

When we receive Control Plance Message 1 (CPM1) it has to be dispatched as an Control Plane Message Received (CPM-Received) event for Object 1 (O1). The next message CPM2 has to be dispatched as a CPM-Received(CPM2) event to object O2. Etc.

Consider another Thread (T-MP) which reads Management-Plane Messages (MPM) from another TCP connection. MPM1 would be dispatched to O1 as a Management Plane Message Received (MPM-Received(MPM1)) event to object O1.

There may be other threads that generate events for the same object O1 (e.g. thread T-Timer that generates timer events, event T-CLI that processes CLI commands, etc. etc.)

Now consider that control-plane message CPM1 is received in thread T-CPM and concurrently (on a different CPU core) management-plane message MPM1 is receive in thread T-MPM. Maybe there is also a timer expiring in thread T-Timer etc. etc.

CMP1 causes event CMP-Received(CMP1) event to be dispatched to object O1.

MPM1 causes event MPM-Received(MPM1) event to be dispatched to the same object O1.

We need some method to handle the fact that two external events are being dispatched to the same object O1 concurrently.

We now have two options

1) Option 1. As soon as CPM1 is received, we make sure that the FSM in T-O1 completely processes event CPM-Received(CPM1) and make sure that there is no possibility of any other event being queued to the same thread T-O1 before that processing is complete. This means that any other thread that would like to post an event for object O1 must be blocked until T-O1 has finished processing event CPM-Received(CPM1). In practice this means that object O1 must take some sort of lock or semaphore to prevent any other thread from posting events, which will block any other thread from posting events to O1. But wait, it gets worse: this means that T-MPM and T-Timer may be blocked on object O1 and may be prevented from reading messages for other objects O2, O3, ... etc. This again means, that your application essentially becomes single-threaded on a per-object basis. Goodbye concurrency.

2) Option 2. Instead of blocking, each object (i.e. each FSM instance) has its own queue of events. When thread T-CPM reads a CPM message for object O1, it posts the corresponding event on the event queue of Object O1. That event will be processed asynchronously in thread T-O1, and T-CPM can continue in parallel to read messages for other objects O2, O3, O4, ...

In the next section, I will make the argument that option 2 is what is used in real-life by massively parallel platforms such as the Open Telecom Platform (OTP) (http://erlang.org/doc/ http://erlang.org/doc/)

[Note] AN EXAMPLE OF QUEUING BEING USED IN A REAL-LIFE MULTI-THREADED FSM IMPLEMENTATION

As an example of a massively "proven in real life" multi-threaded FSM implementation, have a look at the implementation of FSMs in Erlang. The module is called gen_fsm and the documentation is at http://www1.erlang.org/doc/design_principles/fsm.html http://www1.erlang.org/doc/design_principles/fsm.html. Notice that generating an event (for example the button function in section 3.4) is implementing by calling gen_fsm:send_event which is implemented as follows (I quote from section 3.4): "the event is made into a message and sent to gen_fsm". As all messages in Erlang, the message is queued on a message queue. To continue the quote: "when the event is received (which in Erlang means dequeued from the message queue), the gen_fsm calls StateName(Event, StateData)". In other words, events are queued and there may be multiple events on the queue.

ON THE ABSENCE OF CHAINED EVENT PUSHING IN OTHER FRAMEWORKS

I will now make the argument that other frameworks (such as gen_fsm in OTP) do not need to concent of "pushing (chained) events from inside an action function", which is the root issue of the whole discussion.

A concrete example of using the gen_fsm framework is here: https://github.com/brunorijsman/erlang-bgp/blob/master/src/bgp_connection_fsm.erl https://github.com/brunorijsman/erlang-bgp/blob/master/src/bgp_connection_fsm.erl This is an Erlang FSM which I wrote to implement the formal FSM that is specified in the most recent BGP RFC (see RFC 4171 chapter 8). Note that the Erlang gen_fsm framework does not make use of pushing events from inside the transition.

This concrete example demonstrates that all actions are always executed as a single atomic event without the possibility of other events being intertwined.

Conditionals that influence the to-state can be implemented right in the action itself with if statement that return one to-state or another to-state. There is no need for event pushing to implement conditional to-state transitions. See for example function state_connect on line 684 of bgp_connection_fsm.erl where the actions and the to-state depend on the value of check_delay_open_attribute.

My implementation of always executing "chained" events first is intended to emulate this behavior in Erlang, i.e. that all actions related to a single external event are alway processed atomically without the possibility of interleaving other external events.

I would have preferred to achieve this by supporting an if-then-else construction which allows the to-state to depend on the action, in a similar manner as Erlang supports. I made that very comment in the first set of comments, but you did not agree. Hence, I resorted to emulating this behavior with the concept of chained event vs pushed events.

ALTERNATIVE INTERPRETATION OF YOUR EXAMPLE

You said "you end up in state3 which cannot process event3"

This made me think that maybe you use an event queue after all, but only a single one:

Starting state: State1 Event1 happens. Enqueu Event1 (Never mind your rule "you never queue up two events before you push the FSM") Event2 happens. Enqueu Event2 (Never mind your rule "you never queue up two events before you push the FSM") Now start processing the FSM. Pop Event1 from the event queue Processing of Event1: Event1 in State1 => push Event3, and go to State2 Enqueu Event3 (Never mind your rule "you never queue up two events before you push the FSM") We are now in State2 Pop Event2 from the event queue Processing of Event2: Event2 in State2 => State3 We are now in State3 Pop Event3 from the event queue Processing of Event3: NO TRANSITION FOUND (Maybe you intended "Event3: State3->State1"?)

Summary: State1 -> Event1 -> State2 -> Event2 -> State3 -> Event3 -> missing transition

This is the implementation that I had originally.

Here is an graphic example of why this is problematic.

StateRiflePointsForward + EventCheckMuzzle => ActionCheckBarrel, goto StateLookingInBarrel StateLookingInBarrel + EventBarrelLooksClean => ActionPointRifleForward, goto StateRiflePointsForward StateLookingInBarrel + EventBarrelLooksDirty => ActionCleanBarrel, action PointRifleForward, goto StateRiflePointsForward StateRiflePointsForward + EventSurprisedByJumpingDeer => push EventPullTrigger, stay in state StateRiflePointsForward StateLookingInBarrel + EventSurprisedByJumpingDeer => (don't pull trrigger), stay in state StateLookingInBarrel StateRiflePointsForward + EventPullTrigger => ActionFireIntoTheWoods, stay in state StateRiflePointsForward StateLookingInBarrel + EventPullTrigger => ActionBlowYourHeadOff, goto state StateYouAreDead

ActionCheckBarrel: if barrel is clean push EventBarrelLooksClean else push EventBarrelLooksDirty ActionPointRifleForward: point rifle forward, no pushed events ActionCleanBarrel: clean barrel of rifle, no pushed events ActionFireIntoTheWoods: safely fire into the woods, no harm done, no pushed events ActionBlowYourHeadOff: ohoh, this is bad news, no pushed events

Looking at this state machine (particularly the lines in bold) it feels that we are following safe procedures. But you may end up dead in the following sequence:

Starting state: StateRiflePointsForward Event EventSurprisedByJumpingDeer happens. Enqueue event EventSurprisedByJumpingDeer. Event EventCheckMuzzle happens. Enqueue event EventCheckMuzzle. Start processing the FSM. Pop event EventSurprisedByJumpingDeer Process EventCheckMuzzle in state StateRiflePointsForward StateRiflePointsForward + EventSurprisedByJumpingDeer => push EventPullTrigger, stay in state StateRiflePointsForward Enqueue event EventPullTrigger New state is StateRiflePointsForward Pop event EventCheckMuzzle. Process EventCheckMuzzle in state StateRiflePointsForward StateRiflePointsForward + EventCheckMuzzle => ActionCheckBarrel, goto StateLookingInBarrel ActionActionCheckBarrel => muzzle is clean, push EventMuzzleLooksClean New state is StateLookingInBarrel Pop event EventPullTrigger Process EventPullTrigger in state StateLookingInBarrel: StateLookingInBarrel + EventPullTrigger => ActionBlowYourHeadOff, goto state StateYouAreDead

Uhoh, not what you would have expected looking at the FSM. The problem is that you cannot "have a feel" for what happens, unless you count exactly how many events are pushed as a result of a single external events, and what all possible permutations of those pushed events are.

Note that this problem would not occur if the FSM framework allowed atomic transitions with if statements:

StateRiflePointsForward + EventCheckMuzzle => if barrel is not clean ActionCleanBarrel. Stay in state StateRiflePointsForward StateRiflePointsForward + EventSurprisedByJumpingDeer => ActionFireIntoTheWoods, stay in state StateRiflePointsForward

ActionCleanBarrel: clean barrel of rifle, no pushed events ActionFireIntoTheWoods: safely fire into the woods, no harm done, no pushed events

Note that this FSM is much simpler (and safer ;-)

CONCLUSION

1) As far as I can tell, my approach of using two queues results in the exact same externally observable behavior as your approach of completely processing all events in the FSM immediately when they occur. I am still not convinced that my implementation choice of prioritizing chained events over external events will lead to incorrect behavior or interoperability problems.

2) Furthermore, I argue that completely processing all events in the FSM immediately when they occur is highly inefficient in a multi-threaded enviroment because it requires blocking locks / semaphores that severely single-threads the implementation.

3) Furthermore, that there are real-life implementations of massively multi-threaded FSM frameworks (such as gen_fsm in Erlang OTP) that do in fact using queing for events.

4) Furthermore, those existing frameworks don't require pushing events in action functions. Instead, they allow all actions for a single event to be executed atomically. This is even true of the to-state depends on some if-then-else statement in one of the actions. This would be my preferred solution to the whole discussion (as I already mentioned in my very first set of comments in the first review).

5) For those reasons, I chose to stay with my current implementation.

6) I will reword the text in the deviations.md document to explain what we have discussed. I will not imply that your approach is wrong in any way. I will simply describe it as a different approach with its own pros and cons (the multi-threading aspect is a con in my opinion).

On Aug 8, 2018, at 8:35 AM, Tony Przygienda notifications@github.com wrote:

OK, show me an example where

you process every single event that occured externally immediately on the FSM, i.e. you never queue up two events before you push the FSM and that breaks the FSM. What you try to do is to achieve the equivalent behavior after you queued multiple events before kicking off FSM.

imagine you're in state1

event1: state1->state2 event2: state2->state3 event3: state2->state1

state1 on processing event1 pushes event3, if you start the FSM processing in state1 with the following events occuring externally

event1, event2

you end up in state3 which cannot process event3 and that's what your "reodering of the event queue" tries to fix by "processing the pushed events first". I don't have the proof of equivalence of that here, it may work but that's not normally how things are done ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/brunorijsman/rift-python/issues/5#issuecomment-411450224, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEF9EOXpIQ77JpPOhVLCt05P8j5nnayks5uOwVdgaJpZM4V0JH-.

przygienda commented 6 years ago

Hey Bruno, partially processed the stuff (it's a lot, thanks for taking the time). Answer to last points after thinking some

1) I do think that your "prioritize pushed events before queued" will result in same behavior as what I describe as "process external immediately and queue only pushed events". 2) I see your point on the multi-threaded code feeding the FSM queue. Unless your FSM transitions take a long time I don't see that FSM processing is critically blocking anything much. Yes, if you have tons of async producers feeding one single FSM that takes long transitions, that's a problem. Not something I saw in routing protocols 3) yupp 4) yes, and my experience those FSMs become a complete bog of a mess when implemented. Youo simply never know WHERE the machine ends up being since a state can bifurcate. One of the symptoms is that you can't really draw a picture of such an NFA (https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton) and with that maintanability goes down like a rock. Ultimate state is a one or two state "pseudo FSM" with spaghetti of ifs and case statements for every event checking for which state you're in ... 5) ack, I pushed version saying very clearly "the implementation must behave equivalently to the FSM presented but is in no way buond to implement such FSMs". That would be a very bad style a.k.a. as overspecification anyway 6) yupp, sounds good, sorry my initial reaction wasn't as well thought out as it should be, Bruno ;-)

As to details of my implemenation, I do use in fact a queue per FSM but do drain it pretty much after every push.

przygienda commented 6 years ago

Ah, and to come completely clear, the FSM I give is not really DFA either ;-) since you can use ifs to push different event sequences on exit so it's also NFA in strict sense ;-) Constructing true DFA without extended state leads however to state explosion as we know and is not practical in implementation sense. I prefer the style I chose since I like pictures I guess and I don't mind walking states as long I don't repeat the code ;-) whereas the "can end up in any endstate on transition" made me see code that no'one dared to touch anymore since the logic was impossible to understand and no'one knew how to test it anymore (been there more than once) ...