Closed benpicco closed 3 years ago
Hi, Yes or possible the at86rf233 radio for the embedded AtMegas. The panic status indicate awake Interrupt + PLL Lock Interrupt status are pending for the IRQ_STATUS and transmit start interrupt status for the IRQ_STATUS1. Was trying to add explicit ISR's in at86rf233 for those but w/o result.
Is this really an issue? The application is only whitelisted for samr21-xpro
and iotlab-m3
.
Hello, Yes but whitelisting is not set in stone? IMO besides for very hard technical barriers it also reflects a development status, challenges and work-in-progress. In this particular case RDC combined with deep MCU sleep is proven in other OS:es. RIOT has a good core architecture and can compete and even push the limits.
Yes but whitelisting is not set in stone?
No, but as the message printed while building "expect errors".
IMO besides for very hard technical barriers it also reflects a development status, challenges and work-in-progress.
It could also mean, that it was never tested for other platforms, so if it is broken, one should not be surprised.
In this particular case RDC combined with deep MCU sleep is proven in other OS:es. RIOT has a good core architecture and can compete and even push the limits.
If this is an underlying problem that is known an issue describing that underlying problem should be opened, rather than describing a symptom in an app that isn't cleared for usage with that architecture. The application can still be used as a use case / test case.
Those ominous errors should still be documented. If the source of the problem was known already, finding a fix would probably easy. (As is with most bugs)
Seems the problem is not with this application, but with gnrc_gomach: tests/gnrc_gomach crashes on receiver node with exactly the same core panic:
> IRQ_STATUS 0x81
IRQ_STATUS1 0x1
SCIRQS 00
BATMON 0x22
EIFR 00
PCIFR 00
*** RIOT kernel panic:
cted
Tested with 2 derfmega256 modules.
I haven't tried this yet, but maybe the solution is as simple as CFLAGS += -DTHREAD_STACKSIZE_MAIN=1024
?
That helped getting sock_dns
working.
Tested on derfmega256 board. Doesn't help:
IRQ_STATUS 0x81
IRQ_STATUS1 0x1
SCIRQS 00
BATMON 0x22
EIFR 00
PCIFR 00
*** RIOT kernel panic:
init_queue() !!!!
pid | name | state Q | pri | stack ( used) ( free) | base addr | current
1 | idle | running Q | 15 | 128 ( 126) ( 2) | 0x25f0 | 0x2617
2 | main | bl mutex _ | 7 | 2048 ( 276) ( 1772) | 0x2670 | 0x2d5e
3 | pktdump | bl rx _ | 6 | 2048 ( 160) ( 1888) | 0x2ee4 | 0x3646
4 | at86rf2xx-gomach | bl rx _ | 2 | 512 ( 370) ( 142) | 0x36f9 | 0x3813
| SUM | | | 4736 ( 932) ( 3804)
*** halted.
Does anybody know how ISR shall be properly handled for AVR8? Documentation mentions "an interrupt service routine (ISR) that handles the interrupt is executed in another context." What context is valid for AVR?
Is it written anywhere? Also I would like to know why in the original rtt.c for atmega_common (before #12852) the callback is reset to NULL?
After several trials with derfmega256 and JTAG debugging it looks like context switch in avr8_exit_isr() leads to restoration of completely wrong program counter in idle task. That in turn leads to kernel panic. Most probably the RTT based on symbol counter does something wrong.
I'm not sure how to move forward with debugging of this issue.
@benpicco or @herjulf can you help?
@maribu are you familiar with the AVR interrupt / context switch code?
My first guess would be an interrupt stack overflow, but I can't find a ISR_STACKSIZE
define for AVR.
Seems there is no interrupt stack in AVR...
If I recall correctly, on AVR an IRQ victimizes the stack of whatever thread is currently running - in most cases the idle stack.
Hm so maybe CFLAGS += -DTHREAD_STACKSIZE_IDLE=256
will do the trick?
Yes. But long term it will be better to also introduce an ISR stack, otherwise every thread needs to have at least ISR_STACKSIZE
of extra RAM consumption to work reliable, which is likely not a good idea.
I don't believe the stack size is a problem as I have tested with 2048 stack:
IRQ_STATUS 0x81
IRQ_STATUS1 00
SCIRQS 00
BATMON 0x22
EIFR 00
PCIFR 00
*** RIOT kernel panic:
R %#02x
PCIFR %#02x
pid | name | state Q | pri | stack ( used) ( free) | base addr | current
1 | idle | running Q | 15 | 2048 ( 152) ( 1896) | 0x27d9 | 0x2f80
2 | main | bl mutex _ | 7 | 1024 ( 388) ( 636) | 0x2fd9 | 0x32d3
3 | pktdump | bl rx _ | 6 | 1024 ( 162) ( 862) | 0x3fd9 | 0x4339
4 | at86rf2xx-gomach | bl rx _ | 2 | 1024 ( 368) ( 656) | 0x344c | 0x3764
| SUM | | | 5120 ( 1070) ( 4050)
*** halted.
If you are suspecting the symbol counter RTT, have you tried with CFLAGS += -DRTT_BACKEND_SC=0
?
Judging by the broken format string, we end up here this time.
Result is the same. Actually both RTT are based on the same code, so this is expected. The question is how to try to debug it and how to fix it. Seems the problem is inside context switching for AVR. The issue happens when more than one iterrupt occured. I think it is actually not needed to use full-blown gomach to reproduce the issus. Are there any tests for context switching, may be with interrupts? Are there any written descriptions of how context switch shall work and how interrupt handling shall be done?
I have a branch for a test with context switching from ISR, as I expected an issue there. But it passed and the issue at hand was a stack overflow instead. I can PR the test.
Seems I've caught it and is not related to RTT, but to the 802.15.4 transciever. There is no handler defined for TRX24_RX_START interrupt. So at soon as the transciever start receiving, the BADISR_vect is called that leads to kernel panic. But then I have a question: how rfr2 transciever is able to work in other test applications???
Good find!
The RX start interrupt is normally disabled and can be enabled by setting the NETOPT_RX_START_IRQ
option.
This is only done by gomach
, lwmac
and openwsn
, so it has not been caught so far.
Fix looks easy, so will see.
What boggles my mind is that this option is even supported by the driver, with the SPI based AT86RF2xx modules the RX_START event can only be configured to generate an interrupt on a separate pin (DIG2) which is connected to PB17 on samr21-xpro
.
This is entirely unused and there is no call to netdev->event_callback(netdev, NETDEV_EVENT_RX_STARTED)
to be found in the driver.
So how did gnrc_networking_mac
ever work on samr21-xpro
if this event is never generated?
(btw.: @jia200x can event_callback()
be called from interrupt context now?)
@jia200x can event_callback() be called from interrupt context now?
With the PR that will make use of confirm_send()
it will be possible for drivers that have not yet been ported tonthe 802.15.4 radio HAL, which is still the case for the AT86RF* drivers to the best of my knowledge.
So how did gnrc_networking_mac ever work on samr21-xpro if this event is never generated?
I've found these patterns in gnrc_mac
:/. And some others like preloading frames without disabling CSMA-CA. We should give it a look.
(btw.: @jia200x can event_callback() be called from interrupt context now?)
Yes, it can! It uses the same principle as confirm_send
.
That's great news, that means we can get rid of that 'IRQ flag emulation' (dev->irq_status
) for the ATmegaRF and it also means it's trivial to add the RX start IRQ to the SPI variant. (Just another GPIO interrupt that calls netdev->event_callback(netdev, NETDEV_EVENT_RX_STARTED)
)
That's great news, that means we can get rid of that 'IRQ flag emulation' (dev->irq_status) for the ATmegaRF and it also means it's trivial to add the RX start IRQ to the SPI variant. (Just another GPIO interrupt that calls netdev->event_callback(netdev, NETDEV_EVENT_RX_STARTED))
In addition to that, it would be more than ideal if we try to go towards asynchronous SPI for resolving the other events. It would be ideal if all event_callback
calls get resolved on ISR. Besides saving quite some RAM in some cases (and making it more deterministic), this would simplify a lot the network stack integration
I got confused, there is already a check for AT86RF2XX_IRQ_STATUS_MASK__RX_START
- so the simple fix would be adding
ISR(TRX24_RX_START_vect)
{
avr8_enter_isr();
((at86rf2xx_t *)at86rfmega_dev)->irq_status |= AT86RF2XX_IRQ_STATUS_MASK__RX_START;
/* Call upper layer to process received data */
netdev_trigger_event_isr(at86rfmega_dev);
avr8_exit_isr();
}
or just
ISR(TRX24_RX_START_vect)
{
avr8_enter_isr();
at86rfmega_dev->event_callback(netdev, NETDEV_EVENT_RX_STARTED);
avr8_exit_isr();
}
I got confused, there is already a check for
AT86RF2XX_IRQ_STATUS_MASK__RX_START
- so the simple fix would be adding
Please check proposed fix (#16038). I've added handlers for several additional interrupts.
Thanks for this! Aha so the problem a missing ISR. Good work. Just verified on avr-rss2 board with gnrc_gomach.
Description
examples/gnrc_networking_mac
is broken on (at least) ATmega based MCUs. When a second board comes online it will cause a kernel panic on the first one.To check if this is a general problem I also flashed the example on a
same54-xpro
board with an at86rf233 radio module attached. Here the 32bit ARM board would not crash when the 8bit AVR came online, but AVR would crash after a short while when the ARM board would send period messages or was rebooted.Steps to reproduce the issue
Flash
examples/gnrc_networking_mac
on an ATmega board, ideally one with an ATmega256RFR2 as that chip provides enough memory and the at86rf2xx radio needed for the example.Flash a second compatible board with the example.
Expected results
Both boards can communicate.
Actual results
The moment the second board comes online, the ATMega board crashes:
Versions
RIOT master.