Network Stability - Githubissues

chrislstewart commented 7 years ago

The subscriber client script subscribe.py will occasionally reach time out and stop logging data without fatal error. This has been observed to be preceded by the LED on the CC2531 USB dongle going through the same pattern of flashes as when the network is first being established. After interrupting subscribe.py via Ctrl+C in the terminal, it fails to reconnect to the network, throwing “Error 113: No Route To Host.” Any active canaries are able to reconnect. A reboot of the computer hosting the network is necessary to reconnect subscribe.py.

steelsmithj commented 7 years ago

Thank you very much for the detailed report on the error that was occurring. The report makes it sound like there might be a problem with traffic going through the USB dongle which then causes problems with the subscribe.py script. However, having the devices still registered on bbbb::100 webpage suggest otherwise.

After the socket error does the "last seen" column on the [bbbb::100]/sensors page keep increasing and not reset? If this is the case then that means that there is a problem with 6lbr and the dongle handling the traffic. If not then the problem could be with the mqtt broker.

Also what is your '#define NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE' set to? Located in https://github.com/PureEngineering/contiki/blob/master/examples/canary/mqtt_protobuf_demo/project-conf.h

By increasing this by powers of 2 (8, 16, 32, 64 max) it will greatly help increase the traffic through the mesh, at a cost of power consumption. As you increase the number of devices you will probably have to up the amount.

chrislstewart commented 7 years ago

Jon,

During what we believe to be the crash and automatic recovery of the 6lbr/eth0/br0 framework, the page at http://[bbbb::100] disappears (page not found error). When it comes back, the sensors are mostly gone from the listing at [bbbb::100]/sensors (usually one or two have reconnected by the time refreshing brings up the page), though they re-register automatically over the next minute or so.

After the socket error, the canaries return to just (what I assume is) a basic check-in once per minute; the "last seen" does reset accordingly. It is the consistent near-simultaneous recovery of the network framework and timeout of the subscribe.py client in mosquitto that lead us to believe that the

In /contiki/examples/canary/mqtt_protobuf_demo/project-conf.h, line 47 reads:

define NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE 8

I'll try increasing this value and see if the many-canary network setups are more resilient. Does this also affect the rate at which canaries publish their own sensor readings? Prior to the firmware update, each canary would publish a data point approximately once per second; post-update that dropped to about once per 10 seconds. It would be helpful to be able to tune that frequency for certain deployment scenarios.

I'm copying Dr. Goldblum on the thread so she can independently offer information and/or ask questions relevant to the issue. Let me know if there's any other information that would aid with diagnostics.

Thanks, Chris

On Wed, Oct 11, 2017 at 11:37 AM, steelsmithj notifications@github.com wrote:

Thank you very much for the detailed report on the error that was occurring. The report makes it sound like there might be a problem with traffic going through the USB dongle which then causes problems with the subscribe.py script. However, having the devices still registered on bbbb::100 webpage suggest otherwise.

After the socket error does the "last seen" column on the [bbbb::100]/sensors page keep increasing and not reset? If this is the case then that means that there is a problem with 6lbr and the dongle handling the traffic. If not then the problem could be with the mqtt broker.

Also what is your '#define NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE' set to? Located in https://github.com/PureEngineering/contiki/blob/ master/examples/canary/mqtt_protobuf_demo/project-conf.h

By increasing this by powers of 2 (8, 16, 32, 64 max) it will greatly help increase the traffic through the mesh, at a cost of power consumption. As you increase the number of devices you will probably have to up the amount.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PureEngineering/contiki/issues/7#issuecomment-335906894, or mute the thread https://github.com/notifications/unsubscribe-auth/AeIYsxWBikHZ6cS86sF8QMyC5UgySqztks5srQrpgaJpZM4PwpFG .

steelsmithj commented 7 years ago

In a previous email it was stated that the VM did not contain these network errors. Is this true for the hardware and the VM together, or just the simulator?

quantumqueen commented 7 years ago

It was only tested with the simulator. However, I think we may have found the issue. The subscribe.py script uses the following calls:

mClient.subscribe("c", 0)
mClient.loop_forever()

As the mqtt unsubscribe function is never called, we never fully terminate the loop in the case of an interruption and any child processes that are running may prevent re-subscription without a reboot. We are currently working on modifying the subscription to allow for these corner cases and will test to see if this improves the stability.

Is the data rate set by NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE or MSG_INTERVAL?

On Oct 16, 2017, at 12:27 PM, steelsmithj notifications@github.com wrote:

In a previous email it was stated that the VM did not contain these network errors. Is this true for the hardware and the VM together, or just the simulator?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PureEngineering/contiki/issues/7#issuecomment-337007437, or mute the thread https://github.com/notifications/unsubscribe-auth/AHtU7QeAOtI0yqsMJ-3SYWah3BiQnFG_ks5ss644gaJpZM4PwpFG.

steelsmithj commented 7 years ago

NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE Controls the "wake up" period in the radio duty cycle protocol. To conserve power, each node turns on and off really quickly and checks to see if there is an incoming message. Increasing the check rate will make it turn on and off more often, giving a higher on rate. This makes sure that messages travel through the mesh more reliably, but at the cost of holding the radio on more. As more nodes are added to the mesh, this value will need to increase to make up for the higher amount of data.

#define CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL (CLOCK_SECOND * 10) Located here controls how often a message is published over mqtt. This rate will not be faster than one second without changing core contiki code.

#define SENSOR_READING_PERIOD (CLOCK_SECOND) Located here controls how fast the sensors are polled. Having this larger than your publish interval will result in duplicate information.

chrislstewart commented 7 years ago

Jon,

I’ve modified two of the three values you mentioned (specifics below) in the firmware source, re-compiled it, and flashed the image onto all of our canaries. In addition, I wrote a couple of functions into subscribe.py that handle network setup and trigger automatic and rebuild of the network when an on_disconnect callback is received.

SENSOR_READING_PERIOD: kept equal to (CLOCK_SECOND)

CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL: decreased from (CLOCK_SECOND * 10) to (CLOCK_SECOND)

My understanding of these settings was that the sensors were (and still are) being read approximately once per second, but that only every 10th reading was being published. By decreasing the publish interval to the same as the sensor reading period, we are now publishing every sensor reading.

NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE: increased from 8 to 64 with no apparent change in stability as a function of number of devices connected.

My earlier notion (that the border router software being run off the dongle antenna was the point of failure) seems supported by the lack of improvement in stability with change in the radio toggling frequency on the canaries. A quick google search on stability issues with many connections running through 6lbr yielded multiple high-ranked results on problems people were having when connecting double-digit numbers of ContikiOS devices through 6lbr. At least one thread mentioned having success with Contiki’s rpl-border-router. Do you see any reason why we shouldn’t pursue the use of Contiki’s rpl-border-router, given that the goal is to have (very) many simultaneously connected devices?

Thanks, Chris

On Wed, Oct 18, 2017 at 11:35 AM, steelsmithj notifications@github.com wrote:

NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE Controls the "wake up" period in the radio duty cycle protocol. To conserve power, each node turns on and off really quickly and checks to see if there is an incoming message. Increasing the check rate will make it turn on and off more often, giving a higher on rate. This makes sure that messages travel through the mesh more reliably, but at the cost of holding the radio on more. As more nodes are added to the mesh, this value will need to increase to make up for the higher amount of data.

define CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL (CLOCK_SECOND * 10)

Located here https://github.com/PureEngineering/contiki/blob/master/examples/canary/mqtt_protobuf_demo/cc26xx-web-demo.h controls how often a message is published over mqtt. This rate will not be faster than one second without changing core contiki code.

define SENSOR_READING_PERIOD (CLOCK_SECOND)

Located here https://github.com/PureEngineering/contiki/blob/master/examples/canary/mqtt_protobuf_demo/cc26xx-web-demo.c controls how fast the sensors are polled. Having this larger than your publish interval will result in duplicate information.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PureEngineering/contiki/issues/7#issuecomment-337687277, or mute the thread https://github.com/notifications/unsubscribe-auth/AeIYs3BK8nE8UZqFzSkqoEtY6hw581D-ks5stkUKgaJpZM4PwpFG .

steelsmithj commented 7 years ago

CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL is the period in which the canary will send a packet through 6lbr to the mqtt broker. It is safe to assume that every sensor reading will be published if both are set to clock_second. Although they are two separate timers so I guess there is a small possibly that a value can be published twice before reading or read twice between publishes.

The NETSTACK_CONF_RDC_CHANNEL_CHECK_RATE will increase the performance in the mesh, not to the 6lbr router. As you start spreading out canaries and they begin needing to hop multiple times to get to the router, having a higher check rate will increase the performance of the entire mesh.

The dongle does not perform a channel check rate, it is always receiving. So when a message is sent to it and assuming no collisions or its already reading a message, it should receive it. So the more devices that are a single hop distant to the dongle will decrease its performance.

At a base level 6lbr is designed to use the rpl-border-router example. rpl is the routing protocol which 6lbr is using. We can try using the rpl-border-router but we lose a lot of functionality (no [bbbb::101], mqtt wont work as easily, etc).

I think in order to have very many connected devices the publish intervals need to be longer. Which is counter intuitive to your testing at the moment. The end goal of this mesh network will be very fast communication between canaries (high channel check rate) and then a single message is sent through 6lbr after the canaries have communicated with each other.

PureEngineering / contiki

Network Stability #7

define CC26XX_WEB_DEMO_DEFAULT_PUBLISH_INTERVAL (CLOCK_SECOND * 10)

define SENSOR_READING_PERIOD (CLOCK_SECOND)