espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.58k stars 7.27k forks source link

GPIO ISR handlers stop working when called simultaneously (IDFGH-4231) #6089

Closed eflukx closed 3 years ago

eflukx commented 3 years ago

Environment

Problem Description

I'm developing an IoT-gateway type device. Current hardware can carry multiple (2) radios (RFM95/RFM69 mixed in any way) connected through SPI. Both radio's can generate an GPIO IRQ. A third GPIO IRQ is used for a simple push-button.

When the radios are of the same type and are setup to the exact same settings (frequency, data rate etc.), both radios receive a message at (apx) exactly the same time (give or take a few µs/ns). When this happens the GPIO interrupts freeze up (happens after maybe 1-3 interrupts/receptions) and no interrupts are handled anymore. Also the (otherwise completely unrelated) button stops working, its ISR also is not called anymore.

After chasing this bug for some time, I found out the following:

Both radio's are handled on a single separate task, the button handling has its own task. The (two) radio GPIO/IRQ lines are both tied to the same handler, like this:

gpio_isr_handler_add(pinmap.dio0_0, _sx12xx_irq_handler, (void *)0x11);
gpio_set_intr_type(_pinmap.dio0_0, GPIO_INTR_POSEDGE);

gpio_isr_handler_add(pinmap.dio0_1, _sx12xx_irq_handler, (void *)0x12);
gpio_set_intr_type(_pinmap.dio0_1, GPIO_INTR_POSEDGE);

The radio IRQ handler function looks something like this:

static void IRAM_ATTR _sx12xx_irq_handler(void *radio_irq_bits)
{
    _radio_ctxs[((uint32_t)radio_irq_bits & 3) - 1].last_irq_ts = esp_timer_get_time();

    BaseType_t xHigherPriorityTaskWoken = pdFALSE;
    xEventGroupSetBitsFromISR(_radio_eventgroup, (uint32_t)radio_irq_bits, &xHigherPriorityTaskWoken);

    if (xHigherPriorityTaskWoken)
        portYIELD_FROM_ISR();
}

Essentially setting an exact timestamp in the radio's context and setting a bit in the radio-eventgroup (on which the handling tast is blocking).

The button GPIO/IRQ is similar in fashion, but pushing the hardware state of the button onto a queue for further processing by the button task (finite state machine, emitting higher-level events on the default event loop e.g. for pressed, long-pressed etc.).

static void IRAM_ATTR _gpio_button_irq_handler(void *btn_num)
{
    BaseType_t HigherPriorityTaskWoken;
    btn_hal_event_t btn_evt;

    uint8_t num = (int)btn_num;

    btn_evt.state = button_get_state(num);
    btn_evt.ts = esp_timer_get_time();
    btn_evt.num = num;

    xQueueSendFromISR(button_raw_eventqueue, &btn_evt, &HigherPriorityTaskWoken);
}

The _gpio_button_irq_handler is also not doing anything lengthy or other stuff that (IMO) does not belong in an ISR.

Expected Behavior

Proper working GPIO irq's, also when two are arriving at the same time.

Actual Behavior

(GPIO) Interrupt handling seems to stop (dead-lock?). No (error/warning) messages whatsoever are given on the console. The device remains working; MQTT messages are still processed. The device itself it's not hanging/freezed, only the GPIO irq's are not working/handled anymore.

Steps to reproduce

See above. I will try to reproduce using simpler hardware/software (e.g. a simple switch triggering two irq's simultaneously).

Alvin1Zhang commented 3 years ago

Thanks for reporting.

eflukx commented 3 years ago

Tested with release v4.3, problem persists.

The issue seems to be similar to this post GPIO interrupts lost (hardware race condition)

eflukx commented 3 years ago

Mostly solved by calling gpio_install_isr_service with the (apparently) correct intr_alloc_flags. Which intr_alloc_flags to use in what case is (IMO) currently very poorly documented.

Calling gpio_install_isr_service(0); instead of gpio_install_isr_service(ESP_INTR_FLAG_LOWMED | ESP_INTR_FLAG_EDGE);

This seemed to solve the issue. (Needed to fix another related problem in my codebase. Won't outline here, as that was purely a local thing.)

For who's reading along:

As we're handling multiple edge-triggered interrupts section '3.14. The ESP32 GPIO peripheral may not trigger interrupts correctly.' of the ESP32 errata apply. Implementing the workaround using level interrupts did not fix my original problem. Have implemented the workaround anyway as sporadically the issue may pop us (as per erratum).