Issue with Rime regression test 06-sky-trickle

ejoerns commented 10 years ago

As it can easily be observed in the currently failing Travis tests (e.g. see [1]) there seems to be an issue with Rime, a Radio or the Cooja simulation. The test fails because not all nodes receive their messages. Does someone probably already have a solution for this (obviously randomly appearing) issue?

When I tested this on my PC, I had to discover everything worked well. Testing on my travis account caused a new failure. After playing around for a while, I discovered the problem appearing and disappearing from time to time, randomly. My first thought was that it could be connected to the simulation seed or something similar but I have not found any correlation.

Now my guess is that it is both an issue with the reproducability of simulations and with the radio medium simulation in Cooja. I think I will try to spent further time on trying to find the issue but would like to make sure that I am not looking for an issue that's well known and maybe solved in any branch :).

[1] https://travis-ci.org/contiki-os/contiki/jobs/37695022

cmorty commented 10 years ago

Yes, there is a regression I currently trying to tackle it (https://github.com/contiki-os/contiki/pull/810). Interestingly my test case which failed yesterday is working today. No idea why - I'm running some experiments to find out whether this is an issue with Cooja or Contiki. As for the reproducibility in Cooja: Cooja should be reproducible.

Morty

Am 2014-10-13 um 01:12 schrieb Enrico Jorns:

As it can easily be observed in the currently failing Travis tests (e.g. see [1]) there seems to be an issue with Rime, a Radio or the Cooja simulation. The test fails because not all nodes receive their messages. Does someone probably already have a solution for this (obviously randomly appearing) issue?

When I tested this on my PC, I had to discover everything worked well. Testing on my travis account caused a new failure. After playing around for a while, I discovered the problem appearing and disappearing from time to time, randomly. My first thought was that it could be connected to the simulation seed or something similar but I have not found any correlation.

Now my guess is that it is both an issue with the reproducability of simulations and with the radio medium simulation in Cooja. I think I will try to spent further time on trying to find the issue but would like to make sure that I am not looking for an issue that's well known and maybe solved in any branch :).

[1] https://travis-ci.org/contiki-os/contiki/jobs/37695022

— Reply to this email directly or view it on GitHub https://github.com/contiki-os/contiki/issues/818.

Dipl.-Ing. Moritz 'Morty' Struebe (Wissenschaftlicher Mitarbeiter) Lehrstuhl für Informatik 4 (Verteilte Systeme und Betriebssysteme) Friedrich-Alexander-Universität Erlangen-Nürnberg Martensstr. 1 91058 Erlangen

Tel : +49 9131 85-25419 Fax : +49 9131 85-28732 eMail : struebe@informatik.uni-erlangen.de WWW : http://www4.informatik.uni-erlangen.de/~morty

ejoerns commented 10 years ago

Yes, that's exactly what I discovered. Nearly all tests failed yesterday, some tests today, approx. every 10th on my local machine... I spent some time on it yesterday. It seems that some messages are not delivered to neighboring nodes despite they are in transmission range. And, when looking at the RadioLogger, the quantity of messages sent seems a bit too high... Could be an issue with the Medium or the MSPSim Radio implementation. It is really annoying for testing that even with constant seeds the problem occurs randomly... I don't know how much the transmission issue and the reproducibility issue are connect..

cmorty commented 10 years ago

Hey,

as all the simulation code is Java it should be deterministic. I checked the code and could not find any no obvious concurrency issues - there might be some, though. None the less, with https://github.com/contiki-os/contiki/pull/821 you should now be able to verify that you at least have the same binaries and it's a Java issue.

Morty

Am 2014-10-13 um 11:57 schrieb Enrico Jorns:

Yes, that's exactly what I discovered. Nearly all tests failed yesterday, some tests today, approx. every 10th on my local machine... I spent some time on it yesterday. It seems that some messages are not delivered to neighboring nodes despite they are in transmission range. And, when looking at the RadioLogger, the quantity of messages sent seems a bit too high... Could be an issue with the Medium or the MSPSim Radio implementation. It is really annoying for testing that even with constant seeds the problem occurs randomly... I don't know how much the transmission issue and the reproducibility issue are connect..

Dipl.-Ing. Moritz 'Morty' Struebe (Wissenschaftlicher Mitarbeiter) Lehrstuhl für Informatik 4 (Verteilte Systeme und Betriebssysteme) Friedrich-Alexander-Universität Erlangen-Nürnberg Martensstr. 1 91058 Erlangen

Tel : +49 9131 85-25419 Fax : +49 9131 85-28732 eMail : struebe@informatik.uni-erlangen.de WWW : http://www4.informatik.uni-erlangen.de/~morty

ejoerns commented 9 years ago

Well, I've spent some hours on investigating what is really causing this issue, here is my result:

First of all: It is not really a bug it's an algorithmic feature..

To depict the problem, I've attached this Cooja screenshot:

reason-for-issue

I discovered that in most cases nodes 5, 8, and 9 do not receive messages. Now if you take a look at Node 6, you can see the simulations bottleneck. This node is the only one in transmission range to node 8 (is not really obvious from the picture, but believe me ;) ). I.e. only node 6 can send messages to node 8.

What you can see more clearly from the picture is that node 6 itself has a lot of neighbors (node 7, 4, 2, 10, 3, and of corse 8).

The rest is a bit of implementation detail for the rime trickle-broadcast algorithm. In short: If a node receives a new broadcast message (i.e. with a new sequence number), it pushes the message to the upper layers and then waits for a certain interval. If during a random fraction of this interval no further message is received, it broadcasts the message itself to all its neighbor nodes, increses the interval and waits again..

Thus if one or multiple other nodes already broadcasted the message for the current waiting interval, the waiting node will not broadcast the message anymore for this interval.

Now the node setup shows the crucial point: Node 6 receives a lot of messages from its neighbors and thus, it may receive a lot of sequence id duplicates during its waiting interval. Taking into consideration that it is hard-coded that already a single duplicate prevents Node 6 from forwarding the message itself, this makes clear that this bottleneck is prone to cause long delays depending on the current random seed...

Thus, in the current configuration, even under bad circumstances every broadcast message should arrive at every node, but for potential loong delays..

Possible 'solutions': (except from simply changing the seed for all simulations ;) ):

be more lazy in the trickle algorithm and set for example DUPLICATE_THRESHOLD to 2
restart failed tests with increased delay
modify test network to provide at least a second link to Node 8.

p.s.: for those whoe are interested in, here is my exported jar that allows to reproduce the issue: http://share.ejoerns.de/debug_issue818_simulation.jar

cmorty commented 9 years ago

@ejoerns How did you manage to reproduce that. What troubles me, is that the Travis' Cooja come to a different result than my Cooja, although the binaries are the same (I think we can trust sha1 :) ). As far as I understand Cooja should be deterministic - and if not we should fix that.

ejoerns commented 9 years ago

@cmorty Yes, I was able to reproduce that. Did you use a fixed Random seed? The major trouble I had reproducing this was that some minimal changes in the mote's C code completely changed the simulation result. But this should be ok, as it changes execution times. After each code change I ran the test with a range of random seeds and further investigated those that had timed out.

cmorty commented 9 years ago

@ejoerns Yes, the regression tests use a fixed random seed. Also, as already noted I used sha1-hashes to verify that the binaries are identical -> #821

ejoerns commented 9 years ago

@cmorty You're right, did not take into consideration that regression tests always overwrite seed specified in .csc. Mh.. from that point of view I don't really have an idea why the test results differ between different hosts.. It could possibly be something like diverging PRNG implementations or cosmic rays...

cmorty commented 9 years ago

@ejoerns: Well according to the JAVA doku the PRNG should be well defined and #822 did not reveal any problems either. But maybe I did oversee something. Concurrency is a PITA after all.

g-oikonomou commented 9 years ago

Chaps, where are we with this one? Can it be closed?

contiki-os / contiki

Issue with Rime regression test 06-sky-trickle #818