Trickle implementation with regards to redundancy constant

michaelwgnr commented 7 years ago

Hi there,

I experimented a bit with openMesh and also re-read the trickle RFC several times and I stumbled over something:

According to RFC 6206 - "The Trickle Algorithm", Section 4.2 (emphasis mine):

When an interval begins, Trickle resets c to 0 and sets t to a random point in the interval, taken from the range [I/2, I), that is, values greater than or equal to I/2 and less than I. The interval ends at I.

Whenever Trickle hears a transmission that is "consistent", it increments the counter c.

At time t, Trickle transmits if and only if the counter c is less than the redundancy constant k.

When the interval I expires, Trickle doubles the interval length. If this new interval length would be longer than the time specified by Imax, Trickle sets the interval length I to be the time specified by Imax.

If Trickle hears a transmission that is "inconsistent" and I is greater than Imin, it resets the Trickle timer. To reset the timer, Trickle sets I to Imin and starts a new interval as in step 2. If I is equal to Imin when Trickle hears an "inconsistent" transmission, Trickle does nothing. Trickle can also reset its timer in response to external "events".

So my understanding of the interplay between counter c and consistency constant k is that as soon as a message is "consistent enough", transmission of the message should stop. The counter will only be reset upon receiving an inconsistent transmission.

When working with open mesh I observed that the nodes continue with their transmissions infinitely. Looking at the problem more thoroughly, I realized that the counter will be reset in the check_interval function in trickle.c, whenever the time of the call is equal to or greater than I (or trickle->i in the code).

check_interval is called on reception of a consistent(!) transmission or - basicallay - whenever vh_order_update in version_handler.c is called; which seems to be about every time any handle is received or needs to be transmitted (because of new value or timeout).

So to me it looks like the openMesh implementation of trickle does not adhere to the algorithm, because it resets the counter c at other times than specified in the algorithm. Point 6 explicitly goes back to step 2, which is the only time I understand it can happen that c is reset.

Is there a specific reason, why the counter is reset at these times or is this a bug?

Thanks for any help on this.

michaelwgnr commented 7 years ago

I would be glad if someone could answer this!

Thanks!

trond-snekvik commented 7 years ago

Hi, sorry for not responding earlier.

Your observations on the check_interval() function are correct, this function is called whenever there's any update to a trickle instance. This is to avoid having to run a separate timer for each trickle instance: Instead of executing rule 2 at the exact time of the interval expiry, we do it whenever we do the first operation on a trickle instance after the timer has expired. This saves us some ~20 bytes of memory per trickle instance and a lot of CPU cycles, but obfuscates the procedure some. So yes, check_interval() executes rule 2, but only after the interval has expired.

I think the confusion comes from differing interpretations of the Trickle RFC. In the OpenMesh, we want to broadcast values forever (but at exponentially increasing intervals). This is to let devices that enter the network while it's running to come up to speed on the state of the different handles. We interpret the C value as a per-transmit value, as rule 2 resets it every interval. As mentioned, this reset is done at the first opportunity after the interval has expired.

Please let me know if I misunderstood any parts of your question, or if there still are problems with the execution of this procedure. I'll work on my response time! :smiley:

michaelwgnr commented 7 years ago

Hi Trond

Thanks for the detailed response. I'm still a bit unclear on the details here. If I understand you correctly, there are two issues at play here. I think I understand the first one alright, but am unsure about the second:

The execution of rule 2 in check_interval() happens any time there is an update to a trickle instance (i.e. any trickle instance becomes "inconsistent"). This is a deliberate design choice to save memory and CPU cycles because it reduces the number of timers that need to be running.
You specifically want the mesh to rebroadcast handles infinitely to get new nodes up to speed on all the handles being rebroadcast. This happens implicitely because rule 2 is executed on any update to any handle.

Is this right?

I assume this leads to potentially heavily increased traffic, though: Suppose there are more active handles in the mesh than there is space in the handle cache:

Handle 0xCAFE drops out of the handle cache of node A
the same handle has the persistent flag on node B
node B will at some point rebroadcast 0xCAFE
node A will receive 0xCAFE and start rebroadcasting it at the I_min interval.

Did you experience anything like that? Or do you generally suggest to use the mesh with a low number of handles or rather a handle cache sized to hold all the handles?

Sorry for grinding on this, but it seems for most of my potential use cases, it would be preferable to adhere to trickle as I interpreted it and stop rebroadcasting as soon as a certain consistency is reached, so new values will be quickly and "reliably" flooded through the network, but the network remains idle for most of the time, no matter how many handles are in use. Therefore I wonder if the behaviour would be modifiable to still have the memory and CPU savings, but only reset C when the affected handle receives an update?

trond-snekvik commented 7 years ago

If we add this to your first point, I agree:

The execution of rule 2 in check_interval() happens any time there is an update to a trickle instance, if the interval has expired..

The second point is correct, and so is your observation on the increased traffic. If a handle keeps dropping out of the cache between repeats from a neighbor, the device will start transmitting it again and again. This would balance out a bit by the fact that the cache removes entries on a least recently used basis, but it doesn't remove the problem all-together.

I think @victorpasse wanted similar behavior as you about a year ago, and I gave him this answer. Could this approach work for your use cases?

NordicPlayground / nRF51-ble-bcast-mesh

Trickle implementation with regards to redundancy constant #168