Closed ChrisAtSamraksh closed 9 years ago
Chris, would leaving a mote on overnight be sufficient to bring about this issue?
And the master does receive messages even after being left on, right? So this is an issue that seems to be specific to the radio and TX operations, not a more general failure?
Leaving a mote on overnight would not (as far as I know) cause problems. I am betting this is a radio only problem because the code is definitely still running (GPIO reading and LCD display is still running in the managed code). It needs to be more fully characterized.
Possibly related: I just had the mote get stuck in this endless while loop on line 307 of DeviceCode/Targets/Native/STM32F10x/DeviceCode/drivers/spi/netmf_spi.cpp in function CPU_SPI_ReadWriteByte(...).
while (SPI_I2S_GetFlagStatus(SPI_mod, SPI_I2S_FLAG_RXNE) == RESET);
If we're advertising "Real-Time C#", shouldn't we develop some utility functions to make it easier to enforce timeouts into continuations while waiting for peripheral hardware signals? I realize there's a difference between "real-time" and "fault-masking" but even during execution under non-fault conditions, this and other status flag checks are dependent on hardware properties and are therefore unbounded.
Slightly related question... why is the Radio Driver running inside an IRQ handler...?
A timer was set up to send out a radio Beacon message every five seconds (for example). When the timer fires it places the callback (beaconScheduler() ) in a tasklet which eventually calls the callback.
I didn't know it was called from an IRQ, I assumed it was part of the scheduler.
This is not just radio code doing this, but anything that uses a timer to call a callback function would behave in this manner.
"but anything that uses a timer to call a callback function would behave in this manner."
No, that is not how continuations and completions (including Timer callbacks) work.
The sequence is
If the tasklet system changes are moving user timer events to interrupt handlers, I would strongly argue that architecture is fundamentally broken.
Or maybe I'm overblowing the issue here, but in this particular case it is way too much to be stuffed into an interrupt.
So if user C# timer callbacks are running like this, that would be very bad. That is originally what I thought but now I think that is not what Chris actually meant.
So the issue then are virtual timer callbacks for other HAL devices (that use the virtual timer?). In this particular case the Radio driver. This is less bad but still broken.
Tasklets were written to be similar to Linux where they have two parts, a High and Low. The High part runs in interrupt context (for time sensitive code only), and the Low part is put on a task queue / deferred processing and run in System context.
What it sounds like is that we only have a High part (interrupt context)***. So by default any timed event runs inside an interrupt. In the case of the radio, this appears to mean that practically speaking the entire driver is running inside an interrupt. That is a recipe for disaster.
So what we need is a Low part implementation, a way to stick a task on a queue rather than running it inside the interrupt. I'm not familiar with how Tasklets were envisioned on MF to being with. Do we have provisions for the Low half? Does the new VirtualTimer have such an ability?
***(I see code that says Tasklet_Low but it also appears to be an interrupt).
I agree that this looks too busy for an IRQ handler.. I interpret "way too much stuff" to mean a call chain 12 functions deep with a while loop that can stall waiting on a hardware event.
I agree that the design is broken if one schedules a continuation to avoid spending too much time in an IRQ handler, only to have the continuation do the same thing from another IRQ handler.
My radio problem above is most likely unrelated to the overnight radio death. In porting code from the old MAC API to the new MAC API, nowhere was it documented that one should no longer call CPU_Radio_TurnOn() because csmaMAC.cpp now makes that call (and other problems like that). The main problem was that UnInitialize() functions did not do their job properly at multiple locations. So when I called gcsmaMacObject.UnInitialize() to install an update, CSMAMAC never removed the three HALTimers it set up. And when I implemented csmaMAC::UnInitialize() by making it call gHalTimerManager.StopTimer(...), I found out that causes HALTimerCallback(...) to call a null pointer. And after protecting the null pointer, I found out that HALTimerCallback(...) still does not work because it continues iterating over the maximum number of timers ever allocated so there's a logic bug where its while loop is never satisfied.
So please watch out for DeInit() functions that are not counterparts to their Init() functions.
And please use a singleton pattern with a reference counter if an initialization should only happen once. ... related: Will somebody answer the following question if they already know (so I don't have to look up the data sheet up)? I assume that SPI hardware may be initialized multiple times without doing any harm, right? Because the SD, tinyhal, RF231, LPC2*XX, and Microsoft_SPOT_Hardware_SPI C# all call CPU_SPI_Initialize() at different times and CPU_SPI_Initialize() does not keep track of whether it's already initialized.
Please, will somebody point me to a proper comparison of Tasklets and Continuations? Are Tasklets integrated with the MF Scheduler such that time spent in a Tasklet is paid back to the thread that was supposed to be executing?
Several issues here. Let me clarify what I know.
My main reasons for being bothered by the situation is this:
This is worth bothering about: do we have a proposal of the way out?
From: Nathan Stohs [mailto:notifications@github.com] Sent: Monday, July 21, 2014 4:21 PM To: Samraksh/MF Subject: Re: [MF] Radio left on for many hours will fail to send a packet (#163)
I understand and have no problem with (virtual) hardware timers running under interrupt context, this is perfectly logical and what they are for. But my worry is that they are being used off-label for things that really should be synchronous task queues which is apparently what HAL Completions are for.
It sounds like using HAL completions is something we need to stress.
I'm not aware of any mechanism by which the CLR could compensate for time in interrupt handlers, at least not for a hardware interrupt handler. The CLR is potentially completely unaware it happened. TinyCLR jargon also includes "interrupt handler" as a C# callback, these it could potentially account for.
If this all comes out of real timing requirements, then at some level I have nothing to complain about and should just shut-up, but there are some significant consequences to this (see bottom).
The issue with the while loops in SPI is that the MF SPI interface is 100% synchronous. This is not ideal, but to fix it would either mean a significant extension of the interface or to tightly-couple the driver to the hardware. Furthermore, if we are already in interrupt context (as from the timer here) the point is moot. Consider though that SPI is (or should be) running at 4-8 MHz so even if its not ideal it is small potatoes, at least in this case.
My main reasons for being bothered by the situation is this:
— Reply to this email directly or view it on GitHubhttps://github.com/Samraksh/MF/issues/163#issuecomment-49659793.
Short/Medium term we internalize a design pattern for using HAL Completions or whatever is needed for synchronous task queues. Somebody (Mike, Chris, or myself) should probably create a template or example and we standardize on that.
Long term is a little trickier. We are in maintenance with the radio and MAC driver so I suggest that we just revisit these questions as opportunities and employee bandwidth arise, and keep them on our radar for debugging anything that comes up.
Now that I've dragged us way from the original bug, I should probably attempt to bring us back to the original problem of the radio freezing.
Chris, can you give steps to reproduce the issue and the program you used? If you can give that I will use the JTAG and peek at the SPI traffic with the RF231 which I expect will be illuminating.
The radio can send and receive packets all weekend long without crashing as of 233545ba9aed45359279b69390ac9e179b1a0c4b.
540,000 pakcets sent and 370 packet errors.
With the demo I have been testing how long we can have our software running with no problems. The code still runs because if the button to our demo is pushed the LCD indicates it. However, a newly rebooted master does not get radio messages from one of the button slaves that has been left on for a long time. Upon rebooting, the slave will send a message.