Radio left on for many hours will fail to send a packet

ChrisAtSamraksh commented 10 years ago

With the demo I have been testing how long we can have our software running with no problems. The code still runs because if the button to our demo is pushed the LCD indicates it. However, a newly rebooted master does not get radio messages from one of the button slaves that has been left on for a long time. Upon rebooting, the slave will send a message.

Nathan-Stohs commented 10 years ago

Chris, would leaving a mote on overnight be sufficient to bring about this issue?

And the master does receive messages even after being left on, right? So this is an issue that seems to be specific to the radio and TX operations, not a more general failure?

ChrisAtSamraksh commented 10 years ago

Leaving a mote on overnight would not (as far as I know) cause problems. I am betting this is a radio only problem because the code is definitely still running (GPIO reading and LCD display is still running in the managed code). It needs to be more fully characterized.

MichaelAtSamraksh commented 10 years ago

Possibly related: I just had the mote get stuck in this endless while loop on line 307 of DeviceCode/Targets/Native/STM32F10x/DeviceCode/drivers/spi/netmf_spi.cpp in function CPU_SPI_ReadWriteByte(...).

while (SPI_I2S_GetFlagStatus(SPI_mod, SPI_I2S_FLAG_RXNE) == RESET);

If we're advertising "Real-Time C#", shouldn't we develop some utility functions to make it easier to enforce timeouts into continuations while waiting for peripheral hardware signals? I realize there's a difference between "real-time" and "fault-masking" but even during execution under non-fault conditions, this and other status flag checks are dependent on hardware properties and are therefore unbounded. back trace for cpu_spi_readwritebyte

Nathan-Stohs commented 10 years ago

Slightly related question... why is the Radio Driver running inside an IRQ handler...?

ChrisAtSamraksh commented 10 years ago

A timer was set up to send out a radio Beacon message every five seconds (for example). When the timer fires it places the callback (beaconScheduler() ) in a tasklet which eventually calls the callback.

I didn't know it was called from an IRQ, I assumed it was part of the scheduler.

This is not just radio code doing this, but anything that uses a timer to call a callback function would behave in this manner.

Nathan-Stohs commented 10 years ago

"but anything that uses a timer to call a callback function would behave in this manner."

No, that is not how continuations and completions (including Timer callbacks) work.

The sequence is

Timer setup from scheduler (acting on behalf of OS or C# managed timer).
Timer fires.
Timer calls DequeueAndExec() (in interrupt context).
DequeueAndExec() tells the scheduler to update its run queue, but it does not actually run anything. Still in interrupt context.
interrupt exits
MF returns to main activity loop. Sees it has items in its run queue. Runs them in user context.

If the tasklet system changes are moving user timer events to interrupt handlers, I would strongly argue that architecture is fundamentally broken.

Nathan-Stohs commented 10 years ago

Or maybe I'm overblowing the issue here, but in this particular case it is way too much to be stuffed into an interrupt.

Nathan-Stohs commented 10 years ago

So if user C# timer callbacks are running like this, that would be very bad. That is originally what I thought but now I think that is not what Chris actually meant.

So the issue then are virtual timer callbacks for other HAL devices (that use the virtual timer?). In this particular case the Radio driver. This is less bad but still broken.

Tasklets were written to be similar to Linux where they have two parts, a High and Low. The High part runs in interrupt context (for time sensitive code only), and the Low part is put on a task queue / deferred processing and run in System context.

What it sounds like is that we only have a High part (interrupt context)***. So by default any timed event runs inside an interrupt. In the case of the radio, this appears to mean that practically speaking the entire driver is running inside an interrupt. That is a recipe for disaster.

So what we need is a Low part implementation, a way to stick a task on a queue rather than running it inside the interrupt. I'm not familiar with how Tasklets were envisioned on MF to being with. Do we have provisions for the Low half? Does the new VirtualTimer have such an ability?

***(I see code that says Tasklet_Low but it also appears to be an interrupt).

MichaelAtSamraksh commented 10 years ago

I agree that this looks too busy for an IRQ handler.. I interpret "way too much stuff" to mean a call chain 12 functions deep with a while loop that can stall waiting on a hardware event.

I agree that the design is broken if one schedules a continuation to avoid spending too much time in an IRQ handler, only to have the continuation do the same thing from another IRQ handler.

My radio problem above is most likely unrelated to the overnight radio death. In porting code from the old MAC API to the new MAC API, nowhere was it documented that one should no longer call CPU_Radio_TurnOn() because csmaMAC.cpp now makes that call (and other problems like that). The main problem was that UnInitialize() functions did not do their job properly at multiple locations. So when I called gcsmaMacObject.UnInitialize() to install an update, CSMAMAC never removed the three HALTimers it set up. And when I implemented csmaMAC::UnInitialize() by making it call gHalTimerManager.StopTimer(...), I found out that causes HALTimerCallback(...) to call a null pointer. And after protecting the null pointer, I found out that HALTimerCallback(...) still does not work because it continues iterating over the maximum number of timers ever allocated so there's a logic bug where its while loop is never satisfied.

So please watch out for DeInit() functions that are not counterparts to their Init() functions.

And please use a singleton pattern with a reference counter if an initialization should only happen once. ... related: Will somebody answer the following question if they already know (so I don't have to look up the data sheet up)? I assume that SPI hardware may be initialized multiple times without doing any harm, right? Because the SD, tinyhal, RF231, LPC2*XX, and Microsoft_SPOT_Hardware_SPI C# all call CPU_SPI_Initialize() at different times and CPU_SPI_Initialize() does not keep track of whether it's already initialized.

Please, will somebody point me to a proper comparison of Tasklets and Continuations? Are Tasklets integrated with the MF Scheduler such that time spent in a Tasklet is paid back to the thread that was supposed to be executing?

mukundansridharan commented 10 years ago

Several issues here. Let me clarify what I know.

C# timers dont run in IRQ or under tasklets. They run under the scheduler.
The tasklets have high and low INTERRUPT queues, this is only about priority. Both their callbacks are executed under software trigged interrrupt IRQs. The reason for this is when it was designed it mainly for virtualizing timers, where the design was supposed to create many "hardware timers". All hardware timers callbacks are executed under IRQs, so are software timers.
Tasklets have been removed from the latest virtual timer design (I thought a majority of us were unhappy with tasklets, and that it does very little for the complexity involved). All it does, at this point, is provide two sets of timers with different priorities. And Nived used only the high-priority timer for all MAC related timers and hence the low-level timers are not used at all.The new vitual timer code can execute its callbacks inside a software interrupt if one were avaialble or under the the main timers IRQ directly.
The tasklets cannot have its own scheduler, since the MF scheduler already has a while true loop, allthough it can use the HAL completion.
CLR Scheduler knows nothing about tasklets. For all practical purposes tasklets are IRQs. As far time compensation of threads, I think the CLR does compensate for time spent in interrupt handlers (although I am not sure of this. Its been quite a while since I looked at the scheduler code closely).
The MAC was decided to be executed directly under the radio IRQ or virtual timer IRQ, since it might require tight timing constraints to send and receive messages. For example, before every send the channel is sensed for 140 us to make sure it is clear, we dont want things to get interrupted at this time. This will be even more true for the duty cycled MAC which we havent finished implementing yet. Although the message receiving part of the MAC can run in user space, since the radio has already received the complete message at that point.
HALTimerManger having a bug and that causing a bug in mac uninitialize is news to me, although I never really looked at its code closely. Ananth has rewritten the HALTimer and it is now called Virtual Timer, although its possible that this bug is carried over in the new implementation. Mike, can I request you to explain the bug to Ananth, so that this is not carried over.
The radio stopping to send packets after several hours sounds familiar.If I remember right, Nived tried to fix this by uninitializing and reinitializing the radio driver if message sending fails. I think this is part of the sending logic in MAC. Although, it was never clear why this happens in the first place. Nived thought this was related to SPI timing and hence simply tried to reinitialize the radio.
There is more that one while loop in the SPI driver and I was always unhappy with it. I had suggested once before to Nived and Nathan that we rewrite it without the while loops, but I was told that this is the best way to implement the SPI.

Nathan-Stohs commented 10 years ago

(2) I understand and have no problem with (virtual) hardware timers running under interrupt context, this is perfectly logical and what they are for. But my worry is that they are being used off-label for things that really should be synchronous task queues which is apparently what HAL Completions are for.
(4) It sounds like using HAL completions is something we need to stress.
(5) I'm not aware of any mechanism by which the CLR could compensate for time in interrupt handlers, at least not for a hardware interrupt handler. The CLR is potentially completely unaware it happened. TinyCLR jargon also includes "interrupt handler" as a C# callback, these it could potentially account for.
(6) If this all comes out of real timing requirements, then at some level I have nothing to complain about and should just shut-up, but there are some significant consequences to this (see bottom).
(9) The issue with the while loops in SPI is that the MF SPI interface is 100% synchronous. This is not ideal, but to fix it would either mean a significant extension of the interface or to tightly-couple the driver to the hardware. Furthermore, if we are already in interrupt context (as from the timer here) the point is moot. Consider though that SPI is (or should be) running at 4-8 MHz so even if its not ideal the resulting problem is small potatoes, at least in this case.

My main reasons for being bothered by the situation is this:

Large amounts of time spent in the an interrupt handler have a negative impact on system performance from the user perspective.
My biggest problem is that if large chunks of code end up running in interrupts, then we essentially end up with an asynchronous driver and I'm guessing it wasn't actually designed with that in mind. Potentially this is an endless source of headaches.
I absolutely believe you that there are time-critical tasks in the radio driver. My worry is that everything is being treated as a time critical task.
Furthermore I worry that the true time-critical tasks could be better served (see above example). Interrupts should only be doing the last mile stuff, if we are doing things like scheduling or running large chunks of the driver inside the interrupt that sort of defeats the purpose.

AnishAtSamraksh commented 10 years ago

This is worth bothering about: do we have a proposal of the way out?

From: Nathan Stohs [mailto:notifications@github.com] Sent: Monday, July 21, 2014 4:21 PM To: Samraksh/MF Subject: Re: [MF] Radio left on for many hours will fail to send a packet (#163)

I understand and have no problem with (virtual) hardware timers running under interrupt context, this is perfectly logical and what they are for. But my worry is that they are being used off-label for things that really should be synchronous task queues which is apparently what HAL Completions are for.

It sounds like using HAL completions is something we need to stress.

I'm not aware of any mechanism by which the CLR could compensate for time in interrupt handlers, at least not for a hardware interrupt handler. The CLR is potentially completely unaware it happened. TinyCLR jargon also includes "interrupt handler" as a C# callback, these it could potentially account for.

If this all comes out of real timing requirements, then at some level I have nothing to complain about and should just shut-up, but there are some significant consequences to this (see bottom).

The issue with the while loops in SPI is that the MF SPI interface is 100% synchronous. This is not ideal, but to fix it would either mean a significant extension of the interface or to tightly-couple the driver to the hardware. Furthermore, if we are already in interrupt context (as from the timer here) the point is moot. Consider though that SPI is (or should be) running at 4-8 MHz so even if its not ideal it is small potatoes, at least in this case.

My main reasons for being bothered by the situation is this:

Large amounts of time spent in the an interrupt handler have a negative impact on system performance from the user perspective.
My biggest problem is that if large chunks of code end up running in interrupts, then we essentially end up with an asynchronous driver and I'm guessing it wasn't actually designed with that in mind. Potentially this is an endless source of headaches.
I absolutely believe you that there are time-critical tasks in the radio driver. My worry is that everything is being treated as a time critical task.
Furthermore I worry that the true time-critical tasks could be better served (see above example). Interrupts should only be doing the last mile stuff, if we are doing things like scheduling or running large chunks of the driver inside the interrupt that sort of defeats the purpose.

— Reply to this email directly or view it on GitHubhttps://github.com/Samraksh/MF/issues/163#issuecomment-49659793.

Nathan-Stohs commented 10 years ago

Short/Medium term we internalize a design pattern for using HAL Completions or whatever is needed for synchronous task queues. Somebody (Mike, Chris, or myself) should probably create a template or example and we standardize on that.

Long term is a little trickier. We are in maintenance with the radio and MAC driver so I suggest that we just revisit these questions as opportunities and employee bandwidth arise, and keep them on our radar for debugging anything that comes up.

Now that I've dragged us way from the original bug, I should probably attempt to bring us back to the original problem of the radio freezing.

Chris, can you give steps to reproduce the issue and the program you used? If you can give that I will use the JTAG and peek at the SPI traffic with the RF231 which I expect will be illuminating.

ChrisAtSamraksh commented 9 years ago

The radio can send and receive packets all weekend long without crashing as of 233545ba9aed45359279b69390ac9e179b1a0c4b.

540,000 pakcets sent and 370 packet errors.

Samraksh / eMote

Radio left on for many hours will fail to send a packet #163