contiki-os / contiki

The official git repository for Contiki, the open source OS for the Internet of Things
http://www.contiki-os.org/
Other
3.71k stars 2.58k forks source link

CC1310 contikimac + prop-mode causes reboot under heavy traffic #1933

Open andrewbrannan opened 7 years ago

andrewbrannan commented 7 years ago

I've observed an issue with the CC1310 while using Contikimac that causes a device to randomly reboot under heavy traffic. The issue only exists when there is more than 1 node on a network and occurrences increase with the number of nodes on a network. I'm able to reproduce it as follows:

The problem seems to be contikimac specific. I've tracked it down the the on() function of prop-mode.c being called by contikimac while it's rxing a burst. After placing some strategic debug printouts, I've figured out the sequence of events that causes the reboots.

  1. Contikimac calls prop-mode.c's on() from line 967 (if(we_are_receiving_burst) { on();.....)
  2. on() gets to somewhere between line 877 (oscillators_request_hf_xosc();) and line 912 (oscillators_switch_to_hf_xosc();).
  3. The execution pauses because prop-mode.c's on() is called again from somewhere else. (!!!) It must be being called by an interrupt or rtimer somewhere if it's able to preempt execution, but I don't know enough about the contikimac protocol to tell exactly where the second call is coming from.
  4. The second call to on() executes all the way through successfully.
  5. The first call to on() resumes from wherever it was interrupted.
  6. The first call to on() makes it to line 912 (oscillators_switch_to_hf_xosc();) which has already been called a moment before by the second call to on().
  7. The device reboots.

I'm able to prevent the issue entirely by placing a ti_lib_int_master_disable(); in on() before requesting the hf oscillators and then the corresponding ti_lib_int_master_enable(); right before the switch. This feels like a hack though.

It'd be great if somebody with a bit more familiarity with contikimac could have a look at this, seems like it's just a case of an interrupt or rtimer not being suppressed when the radio is already being turned on.

Wonder if this could also be related to #1878

g-oikonomou commented 7 years ago

Thanks for the detailed report.

If on() gets called at line 967, then the value of we_are_receiving_burst should be non-zero.

An rtimer interrupt will execute powercycle, which yields at various points. If it yields L433, then the next time it gets called we_are_receiving_burst will be non-zero, so it will yield again. I think this is not where on() gets called for the 2nd time. powercycle also yields at L483 and L514, so perhaps something there.

To switch the HF clock source to the HF XOSC, first we place a request for the XOSC to start up (oscillators_request_hf_xosc()). While it's starting up we can do other things and, when it's ready, we perform the actual (blocking) switch (oscillators_switch_to_hf_xosc()). Multiple requests to switch are OK as long as the XOSC is ready. So in theory the 2nd on() (the one called from the interrupt) should work.

However, if within the interrupt context the code selects the RC as the HF clock source, the XOSC will be allowed to power down. This can happen if e.g. off also gets called within the same interrupt. If this happens, then the clock source switch in the 1st on() will never complete (because the XOSC is left powered down). My gut feeling is that within the interrupt context off() also gets called. Can you double check please?

g-oikonomou commented 7 years ago

Having said all that, ContikiMAC tries to avoid calling NETSTACK_RADIO.on(); twice in a nested fashion. Look at ContikiMAC's on() function. A second call to it will return early, because radio_is_on will be 1. This makes me think that perhaps in the interrupt context ContikiMAC attempts to call NETSTACK_RADIO.channel_clear() (which will in turn call on within the radio driver).

Edit: I'm suspecting the channel_clear() call in L455.

mdmobashir commented 7 years ago

Thanks for uploading such a detailed analysis!

I'm facing similar reboot situation with ContikiMac and PropMode but on cc1350. Can anyone confirm if the above hack fixed the issue? Or if there is another solution for this?

Thanks.

amitbhanja commented 6 years ago

@mdmobashir I have also tried the hack. But it does not seem to fix the issue here. DO you have any updates. I have the similar problem with cc1350