examples/gnrc_networking_mac broken on ATmega

benpicco commented 4 years ago

Description

examples/gnrc_networking_mac is broken on (at least) ATmega based MCUs. When a second board comes online it will cause a kernel panic on the first one.

To check if this is a general problem I also flashed the example on a same54-xpro board with an at86rf233 radio module attached. Here the 32bit ARM board would not crash when the 8bit AVR came online, but AVR would crash after a short while when the ARM board would send period messages or was rebooted.

Steps to reproduce the issue

Flash examples/gnrc_networking_mac on an ATmega board, ideally one with an ATmega256RFR2 as that chip provides enough memory and the at86rf2xx radio needed for the example.

Flash a second compatible board with the example.

Expected results

Both boards can communicate.

Actual results

The moment the second board comes online, the ATMega board crashes:

2019-12-02 15:31:45,498 #  IRQ_STATUS 0x81
2019-12-02 15:31:45,499 # IRQ_STATUS1 0x1
2019-12-02 15:31:45,501 # SCIRQS 00
2019-12-02 15:31:45,501 # BATMON 0x22
2019-12-02 15:31:45,502 # EIFR 00
2019-12-02 15:31:45,504 # PCIFR 00
2019-12-02 15:31:45,504 # *** RIOT kernel panic:
2019-12-02 15:31:45,507 # ver on port %u
2019-12-02 15:31:45,507 # 
2019-12-02 15:31:45,507 # 
2019-12-02 15:31:45,515 #   pid | name                 | state    Q | pri | stack  ( used) | base addr  | current     
2019-12-02 15:31:45,524 #     1 | idle                 | running  Q |  15 |    128 (  126) |     0x3c30 |     0x3c54 
2019-12-02 15:31:45,532 #     2 | main                 | bl mutex _ |   7 |    640 (  272) |     0x3cb0 |     0x3e23 
2019-12-02 15:31:45,539 #     3 | pktdump              | bl rx    _ |   6 |    640 (  166) |     0x5bed |     0x5dca 
2019-12-02 15:31:45,547 #     4 | 6lo                  | bl rx    _ |   3 |    512 (  248) |     0x6c24 |     0x6d79 
2019-12-02 15:31:45,556 #     5 | ipv6                 | bl rx    _ |   4 |    512 (  262) |     0x62a9 |     0x63fc 
2019-12-02 15:31:45,564 #     6 | udp                  | bl rx    _ |   5 |    512 (  174) |     0x5e6f |     0x5fc4 
2019-12-02 15:31:45,573 #     7 | at86rf2xx-gomach     | bl rx    _ |   2 |    512 (  394) |     0x6084 |     0x6199 
2019-12-02 15:31:45,581 #     8 | RPL                  | bl rx    _ |   5 |    512 (  102) |     0x6a24 |     0x6bc1 
2019-12-02 15:31:45,588 #       | SUM                  |            |     |   3968 ( 1744)
2019-12-02 15:31:45,588 # 
2019-12-02 15:31:45,588 # *** halted.

Versions

RIOT master.

herjulf commented 4 years ago

Hi, Yes or possible the at86rf233 radio for the embedded AtMegas. The panic status indicate awake Interrupt + PLL Lock Interrupt status are pending for the IRQ_STATUS and transmit start interrupt status for the IRQ_STATUS1. Was trying to add explicit ISR's in at86rf233 for those but w/o result.

miri64 commented 4 years ago

Is this really an issue? The application is only whitelisted for samr21-xpro and iotlab-m3.

herjulf commented 4 years ago

Hello, Yes but whitelisting is not set in stone? IMO besides for very hard technical barriers it also reflects a development status, challenges and work-in-progress. In this particular case RDC combined with deep MCU sleep is proven in other OS:es. RIOT has a good core architecture and can compete and even push the limits.

miri64 commented 4 years ago

Yes but whitelisting is not set in stone?

No, but as the message printed while building "expect errors".

IMO besides for very hard technical barriers it also reflects a development status, challenges and work-in-progress.

It could also mean, that it was never tested for other platforms, so if it is broken, one should not be surprised.

In this particular case RDC combined with deep MCU sleep is proven in other OS:es. RIOT has a good core architecture and can compete and even push the limits.

If this is an underlying problem that is known an issue describing that underlying problem should be opened, rather than describing a symptom in an app that isn't cleared for usage with that architecture. The application can still be used as a use case / test case.

benpicco commented 4 years ago

Those ominous errors should still be documented. If the source of the problem was known already, finding a fix would probably easy. (As is with most bugs)

chudov commented 4 years ago

Seems the problem is not with this application, but with gnrc_gomach: tests/gnrc_gomach crashes on receiver node with exactly the same core panic:

> IRQ_STATUS 0x81
IRQ_STATUS1 0x1
SCIRQS 00
BATMON 0x22
EIFR 00
PCIFR 00
*** RIOT kernel panic:
cted

Tested with 2 derfmega256 modules.

benpicco commented 4 years ago

I haven't tried this yet, but maybe the solution is as simple as CFLAGS += -DTHREAD_STACKSIZE_MAIN=1024?

That helped getting sock_dns working.

chudov commented 4 years ago

Tested on derfmega256 board. Doesn't help:

IRQ_STATUS 0x81
IRQ_STATUS1 0x1
SCIRQS 00
BATMON 0x22
EIFR 00
PCIFR 00
*** RIOT kernel panic:
init_queue() !!!!

pid | name                 | state    Q | pri | stack  ( used) ( free) | base addr  | current     
  1 | idle                 | running  Q |  15 |    128 (  126) (    2) |     0x25f0 |     0x2617 
  2 | main                 | bl mutex _ |   7 |   2048 (  276) ( 1772) |     0x2670 |     0x2d5e 
  3 | pktdump              | bl rx    _ |   6 |   2048 (  160) ( 1888) |     0x2ee4 |     0x3646 
  4 | at86rf2xx-gomach     | bl rx    _ |   2 |    512 (  370) (  142) |     0x36f9 |     0x3813 
    | SUM                  |            |     |   4736 (  932) ( 3804)

*** halted.

chudov commented 3 years ago

Does anybody know how ISR shall be properly handled for AVR8? Documentation mentions "an interrupt service routine (ISR) that handles the interrupt is executed in another context." What context is valid for AVR?

Is it written anywhere? Also I would like to know why in the original rtt.c for atmega_common (before #12852) the callback is reset to NULL?

After several trials with derfmega256 and JTAG debugging it looks like context switch in avr8_exit_isr() leads to restoration of completely wrong program counter in idle task. That in turn leads to kernel panic. Most probably the RTT based on symbol counter does something wrong.

I'm not sure how to move forward with debugging of this issue.

@benpicco or @herjulf can you help?

benpicco commented 3 years ago

@maribu are you familiar with the AVR interrupt / context switch code? My first guess would be an interrupt stack overflow, but I can't find a ISR_STACKSIZE define for AVR.

chudov commented 3 years ago

Seems there is no interrupt stack in AVR...

maribu commented 3 years ago

If I recall correctly, on AVR an IRQ victimizes the stack of whatever thread is currently running - in most cases the idle stack.

benpicco commented 3 years ago

Hm so maybe CFLAGS += -DTHREAD_STACKSIZE_IDLE=256 will do the trick?

maribu commented 3 years ago

Yes. But long term it will be better to also introduce an ISR stack, otherwise every thread needs to have at least ISR_STACKSIZE of extra RAM consumption to work reliable, which is likely not a good idea.

chudov commented 3 years ago

I don't believe the stack size is a problem as I have tested with 2048 stack:

IRQ_STATUS 0x81
IRQ_STATUS1 00
SCIRQS 00
BATMON 0x22
EIFR 00
PCIFR 00
*** RIOT kernel panic:
R %#02x
PCIFR %#02x

pid | name                 | state    Q | pri | stack  ( used) ( free) | base addr  | current     
  1 | idle                 | running  Q |  15 |   2048 (  152) ( 1896) |     0x27d9 |     0x2f80 
  2 | main                 | bl mutex _ |   7 |   1024 (  388) (  636) |     0x2fd9 |     0x32d3 
  3 | pktdump              | bl rx    _ |   6 |   1024 (  162) (  862) |     0x3fd9 |     0x4339 
  4 | at86rf2xx-gomach     | bl rx    _ |   2 |   1024 (  368) (  656) |     0x344c |     0x3764 
    | SUM                  |            |     |   5120 ( 1070) ( 4050)
*** halted.

benpicco commented 3 years ago

If you are suspecting the symbol counter RTT, have you tried with CFLAGS += -DRTT_BACKEND_SC=0?

Judging by the broken format string, we end up here this time.

chudov commented 3 years ago

Result is the same. Actually both RTT are based on the same code, so this is expected. The question is how to try to debug it and how to fix it. Seems the problem is inside context switching for AVR. The issue happens when more than one iterrupt occured. I think it is actually not needed to use full-blown gomach to reproduce the issus. Are there any tests for context switching, may be with interrupts? Are there any written descriptions of how context switch shall work and how interrupt handling shall be done?

maribu commented 3 years ago

I have a branch for a test with context switching from ISR, as I expected an issue there. But it passed and the issue at hand was a stack overflow instead. I can PR the test.

chudov commented 3 years ago

Seems I've caught it and is not related to RTT, but to the 802.15.4 transciever. There is no handler defined for TRX24_RX_START interrupt. So at soon as the transciever start receiving, the BADISR_vect is called that leads to kernel panic. But then I have a question: how rfr2 transciever is able to work in other test applications???

benpicco commented 3 years ago

Good find! The RX start interrupt is normally disabled and can be enabled by setting the NETOPT_RX_START_IRQ option. This is only done by gomach, lwmac and openwsn, so it has not been caught so far.

chudov commented 3 years ago

Fix looks easy, so will see.

benpicco commented 3 years ago

What boggles my mind is that this option is even supported by the driver, with the SPI based AT86RF2xx modules the RX_START event can only be configured to generate an interrupt on a separate pin (DIG2) which is connected to PB17 on samr21-xpro.

This is entirely unused and there is no call to netdev->event_callback(netdev, NETDEV_EVENT_RX_STARTED) to be found in the driver.

So how did gnrc_networking_mac ever work on samr21-xpro if this event is never generated?

(btw.: @jia200x can event_callback() be called from interrupt context now?)

maribu commented 3 years ago

@jia200x can event_callback() be called from interrupt context now?

With the PR that will make use of confirm_send() it will be possible for drivers that have not yet been ported tonthe 802.15.4 radio HAL, which is still the case for the AT86RF* drivers to the best of my knowledge.

jia200x commented 3 years ago

So how did gnrc_networking_mac ever work on samr21-xpro if this event is never generated?

I've found these patterns in gnrc_mac :/. And some others like preloading frames without disabling CSMA-CA. We should give it a look.

(btw.: @jia200x can event_callback() be called from interrupt context now?)

Yes, it can! It uses the same principle as confirm_send.

benpicco commented 3 years ago

That's great news, that means we can get rid of that 'IRQ flag emulation' (dev->irq_status) for the ATmegaRF and it also means it's trivial to add the RX start IRQ to the SPI variant. (Just another GPIO interrupt that calls netdev->event_callback(netdev, NETDEV_EVENT_RX_STARTED))

jia200x commented 3 years ago

That's great news, that means we can get rid of that 'IRQ flag emulation' (dev->irq_status) for the ATmegaRF and it also means it's trivial to add the RX start IRQ to the SPI variant. (Just another GPIO interrupt that calls netdev->event_callback(netdev, NETDEV_EVENT_RX_STARTED))

In addition to that, it would be more than ideal if we try to go towards asynchronous SPI for resolving the other events. It would be ideal if all event_callback calls get resolved on ISR. Besides saving quite some RAM in some cases (and making it more deterministic), this would simplify a lot the network stack integration

benpicco commented 3 years ago

I got confused, there is already a check for AT86RF2XX_IRQ_STATUS_MASK__RX_START - so the simple fix would be adding

ISR(TRX24_RX_START_vect)
{
    avr8_enter_isr();

    ((at86rf2xx_t *)at86rfmega_dev)->irq_status |= AT86RF2XX_IRQ_STATUS_MASK__RX_START;

    /* Call upper layer to process received data */
    netdev_trigger_event_isr(at86rfmega_dev);

    avr8_exit_isr();
}

or just

ISR(TRX24_RX_START_vect)
{
    avr8_enter_isr();

    at86rfmega_dev->event_callback(netdev, NETDEV_EVENT_RX_STARTED);

    avr8_exit_isr();
}

chudov commented 3 years ago

I got confused, there is already a check for AT86RF2XX_IRQ_STATUS_MASK__RX_START - so the simple fix would be adding

Please check proposed fix (#16038). I've added handlers for several additional interrupts.

herjulf commented 3 years ago

Thanks for this! Aha so the problem a missing ISR. Good work. Just verified on avr-rss2 board with gnrc_gomach.

RIOT-OS / RIOT