espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.04k stars 7.13k forks source link

host_recv_pkt_cb is called from interrupt context causing a crash. (IDFGH-5265) #7031

Open dr-gino opened 3 years ago

dr-gino commented 3 years ago

Environment

Problem Description

https://github.com/bluekitchen/btstack/issues/357 When using the Bluekitchen BTStack as the Bluetooth stack host_recv_pkt_cb is called from interrupt context causing a crash. According to the author of BTStack this is an issue with ESP IDF and they suggested raising an issue here.

Expected Behavior

No crash.

Actual Behavior

abort() was called at PC 0x40082a4a on core 0

Steps to reproduce

Install BTStack and run the example code below. Open Bluetooth settings on iPhone and wait for ESP32 to request bonding, once accepted the ESP32 crashes.

Code to reproduce this issue

#include <stdio.h>
#include <stdlib.h>

#include "btstack_port_esp32.h"
#include "btstack_run_loop.h"
#include "btstack_run_loop_freertos.h"

#include "btstack_config.h"
#include "btstack.h"

static btstack_packet_callback_registration_t hci_event_reg_;
// Change address below to remote device (iPhone in my case)
static uint8_t remote_device_[] = {0x64, 0x70, 0x33, 0x1C, 0xAD, 0x2D};

static void handle_sdp_client_query_result(uint8_t packet_type, uint16_t channel, uint8_t *packet, uint16_t size)
{
    // Empty, not needed for example.
}

void hci_packet_handler(uint8_t packet_type, uint16_t channel, uint8_t *packet, uint16_t size)
{
    switch (hci_event_packet_get_type(packet)) {    
    case BTSTACK_EVENT_STATE:
        if (btstack_event_state_get_state(packet) == HCI_STATE_WORKING) {
            printf("Starting dedicated bonding...\r\n");
            gap_dedicated_bonding(remote_device_, 1);
        }
        break;
    case GAP_EVENT_DEDICATED_BONDING_COMPLETED:
        if (gap_event_dedicated_bonding_completed_get_status(packet) == 0) {
            printf("Bonding completed, querying device.\r\n");
            sdp_client_query_uuid16(&handle_sdp_client_query_result, remote_device_, BLUETOOTH_SERVICE_CLASS_NAP);
        }
        break;
    default:
        break;
    }
}

static void task_btstack(void *param)
{
    btstack_init();
    l2cap_init();
    sdp_init();

    hci_event_reg_.callback = &hci_packet_handler;
    hci_add_event_handler(&hci_event_reg_);

    gap_ssp_set_io_capability(SSP_IO_CAPABILITY_NO_INPUT_NO_OUTPUT);
    gap_set_local_name("testdev");
    gap_set_class_of_device(0x200418); // Headphones CoD (Class of Device)

    hci_power_control(HCI_POWER_ON);
    btstack_run_loop_execute();
}

void app_main(void)
{
    task_btstack(NULL);
}

Debug Logs

Serial output only shows:

abort() was called at PC 0x40082a4a on core 0

Backtrace:0x4008e6df:0x3ffbdfd0 0x4008ed7d:0x3ffbdff0 0x400927fe:0x3ffbe010 0x40082a4a:0x3ffbe080 0x40082b6d:0x3ffbe0b0 0x400ee012:0x3ffbe0d0 0x400ee075:0x3ffbe110 0x400d958e:0x3ffbe130 0x400d959f:0x3ffbe150 0x4010c25a:0x3ffbe180 0x400fddce:0x3ffbe1a0 0x4001791d:0x3ffbe1c0 0x400fe535:0x3ffbe1e0 0x400fe612:0x3ffbe220 0x400188f5:0x3ffbe250 
0x400fdfc7:0x3ffbe290 0x400175ee:0x3ffbe2b0 0x40017688:0x3ffbe2d0 0x4008dc1a:0x3ffbe300 0x4008ddf9:0x3ffbe320 0x400fdd7b:0x3ffbe340 0x4008de10:0x3ffbe360 0x400824b1:0x3ffbe380 0x40086e0e:0x3ffbc210 0x4008dbc1:0x3ffbc230 0x4010c3f9:0x3ffbc250 0x400fb30c:0x3ffbc270 0x400d9419:0x3ffbc290 0x400d3b05:0x3ffbc2b0 0x400d3c27:0x3ffbc2d0 0x400d77fb:0x3ffbc2f0 0x400d7938:0x3ffbc350 0x400d840f:0x3ffbc390 0x400d90e9:0x3ffbc3d0 0x400d1af5:0x3ffbc400 0x400d1b21:0x3ffbc420 0x400d1423:0x3ffbc440 0x400d2fb5:0x3ffbc460 0x400d327f:0x3ffbc490 0x400d62f3:0x3ffbc4c0 0x400d6842:0x3ffbc510 0x400d9524:0x3ffbc530 0x400d9575:0x3ffbc560 0x400d9c91:0x3ffbc580 0x400d28d2:0x3ffbc5b0 0x400d1455:0x3ffbc5d0 0x400d1461:0x3ffbc5f0 0x400dc9c6:0x3ffbc610 0x4008ed85:0x3ffbc640

Using EspStackTraceDecoder.jar I was able to recover the following trace:

0x4008e6df: panic_abort at C:\Users\Gino\.platformio\packages\framework-espidf\components\esp_system/panic.c:330
0x4008ed7d: esp_system_abort at C:\Users\Gino\.platformio\packages\framework-espidf\components\esp_system/system_api.c:106
0x400927fe: abort at C:\Users\Gino\.platformio\packages\framework-espidf\components\newlib/abort.c:46
0x40082a4a: lock_acquire_generic at C:\Users\Gino\.platformio\packages\framework-espidf\components\newlib/locks.c:142
0x40082b6d: _lock_acquire_recursive at C:\Users\Gino\.platformio\packages\framework-espidf\components\newlib/locks.c:170
0x400ee012: _puts_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/puts.c:89 (discriminator 2)
0x400ee075: puts at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32-elf/src/newlib/newlib/libc/stdio/puts.c:129
0x400d958e: report_recv_called_from_isr at C:\Users\Gino\Desktop\bt/components\btstack/btstack_port_esp32.c:104
0x400d959f: host_recv_pkt_cb at C:\Users\Gino\Desktop\bt/components\btstack/btstack_port_esp32.c:128
0x4010c25a: vhci_send at ??:?
0x400fddce: r_eif_send at ??:?
0x4001791d: ?? ??:0
0x400fe535: hci_tx_start at hci_tl.c:?
0x400fe612: r_hci_tl_send at ??:?
0x400188f5: ?? ??:0
0x400fdfc7: r_hci_send_2_host_hack at ??:?
0x400175ee: ?? ??:0
0x40017688: ?? ??:0
0x4008dc1a: vhci_recv_end at vhci.c:?
0x4008ddf9: r_vhci_isr at ??:?
0x400fdd7b: r_eif_isr at ??:?
0x4008de10: r_eif_isr_wrapper at intc.c:?
0x400824b1: _xt_lowint1 at C:\Users\Gino\.platformio\packages\framework-espidf\components\freertos\xtensa/xtensa_vectors.S:1105
0x40086e0e: cause_sw_intr_to_core_wrapper at C:/Users/Gino/.platformio/packages/framework-espidf/components/bt/controller/bt.c:769
0x4008dbc1: vhci_set_interrupt at vhci.c:?
0x4010c3f9: API_vhci_host_send_packet at ??:?
0x400fb30c: esp_vhci_host_send_packet at C:/Users/Gino/.platformio/packages/framework-espidf/components/bt/controller/bt.c:1008
0x400d9419: transport_send_packet at C:\Users\Gino\Desktop\bt/components\btstack/btstack_port_esp32.c:287
0x400d3b05: hci_send_acl_packet_fragments at C:\Users\Gino\Desktop\bt/components\btstack\src/hci.c:723
0x400d3c27: hci_send_acl_packet_buffer at C:\Users\Gino\Desktop\bt/components\btstack\src/hci.c:785
0x400d77fb: l2cap_send_signaling_packet at C:\Users\Gino\Desktop\bt/components\btstack\src/l2cap.c:1287
0x400d7938: l2cap_run_for_classic_channel at C:\Users\Gino\Desktop\bt/components\btstack\src/l2cap.c:1517
0x400d840f: l2cap_run at C:\Users\Gino\Desktop\bt/components\btstack\src/l2cap.c:1874
0x400d90e9: l2cap_create_channel at C:\Users\Gino\Desktop\bt/components\btstack\src/l2cap.c:2060
0x400d1af5: sdp_client_query at C:\Users\Gino\Desktop\bt/components\btstack\src\classic/sdp_client.c:756
0x400d1b21: sdp_client_query_uuid16 at C:\Users\Gino\Desktop\bt/components\btstack\src\classic/sdp_client.c:761
0x400d1423: hci_packet_handler at C:\Users\Gino\Desktop\bt/src/main.c:31
0x400d2fb5: hci_emit_event at C:\Users\Gino\Desktop\bt/components\btstack\src/hci.c:4832
0x400d327f: hci_emit_dedicated_bonding_result at C:\Users\Gino\Desktop\bt/components\btstack\src/hci.c:5042
0x400d62f3: event_handler at C:\Users\Gino\Desktop\bt/components\btstack\src/hci.c:2720
0x400d6842: packet_handler at C:\Users\Gino\Desktop\bt/components\btstack\src/hci.c:2985
0x400d9524: transport_deliver_packets at C:\Users\Gino\Desktop\bt/components\btstack/btstack_port_esp32.c:180
0x400d9575: transport_process at C:\Users\Gino\Desktop\bt/components\btstack/btstack_port_esp32.c:196
0x400d9c91: btstack_run_loop_freertos_execute at C:\Users\Gino\Desktop\bt/components\btstack\platform\freertos/btstack_run_loop_freertos.c:231
0x400d28d2: btstack_run_loop_execute at C:\Users\Gino\Desktop\bt/components\btstack\src/btstack_run_loop.c:166
0x400d1455: task_btstack at C:\Users\Gino\Desktop\bt/src/main.c:54
0x400d1461: app_main at C:\Users\Gino\Desktop\bt/src/main.c:60
0x400dc9c6: main_task at C:\Users\Gino\.platformio\packages\framework-espidf\components\esp32/cpu_start.c:600 (discriminator 2)
0x4008ed85: vPortTaskWrapper at C:\Users\Gino\.platformio\packages\framework-espidf\components\freertos\xtensa/port.c:143
xiongweichao commented 3 years ago

Hi, @dr-gino

Crash occurs when printf is called in an interrupt. You can replace printf with ets_printf in report_recv_called_from_isr .

Thanks.

dr-gino commented 3 years ago

Hi, @dr-gino

Crash occurs when printf is called in an interrupt. You can replace printf with ets_printf in report_recv_called_from_isr .

Thanks.

Hi @xiongweichao,

You were right, there was a call to printf inside report_recv_called_from_isr commenting it out or replacing it with ets_printf prevents it from crashing but as you can probably tell from the name of the function report_recv_called_from_isr the developer of BTstack did not expect the HCI packet callback to be called in that context.

Is BTstack's author incorrect in assuming that the HCI packet handler callback registered using esp_vhci_host_register_callback is never to be called from within an ISR context?

mringwal commented 3 years ago

Hi @xiongweichao

we've got more data on this. Please have a look at esp32-hw-error-on-acl-packet-without-connection.pklg.zip

In the log, the host stack sends an ACL packet after it receives the HCI Disconnect event. In this situation, the stack should not send the packet. We will fix it in the stack.

However, the same problem can also occur in other situations. Eg.. assume the ESP32 is streaming audio packets to a remote device. When the remote device sends a HCI Disconnect (without first closing AVDTP L2CAP connection), the host stack might queue an AVDTP media packet just before the HCI Disconnect is received. As HCI is asynchronous by design, this cannot be avoided. Sending an Hardware Error seems wrong to me, it would be better to just drop the packet for the (now) invalid HCI connection.

What do you think?

Cheers Matthias

mringwal commented 3 years ago

Any news here? The issue about sending an ACL packet with a now invalid HCI Connection handle is independent from the Bluetooth stack and could happen with NimBLE or Bluedroid as well.

BetterJincheng commented 2 years ago

@mringwal Sorry for such a delayed reply. The scenario you describe is bound to happen. Thanks a lot for your suggestion. Now I share with you our handling of this situation. Receiving packets from an invalid HCI connection causes "OUT OF SYNC". In a certain controller status, esp32 will response with "HW ERROR". According to Bluetooth Core Spec, "hw error event" is implementation-specific. We will evaluate your suggestion, but I can not guarantee the changes to the code. Thanks again!

mringwal commented 2 years ago

@BetterJincheng Thanks for looking at this again. I don't think that emitting a HW Error in this situation because of a race condition is a good user experience :)

If the Bluetooth Controller lib is able to respond with an HW Error Event, it should even be easier to just drop the packet.

BetterJincheng commented 2 years ago

@mringwal We have some more information about this issue.

Due to wrong packet sent from HOST (BTStack) to CTRL, CTRL turned into state of "out of sync" and responsed with an EVT of "HW Error" in an ISR context.

In other word, the host_recv_pkt_cb called from ISR is to notify HOST that downstream packet was wrong (HW Error).

So, the root cause of this issue is the HOST has sent incorrect packet downward.

The information below is helpful to this issue. Please help on it. Was esp_vhci_host_send_packet called only in one task context?

Thanks!

mringwal commented 2 years ago

@BetterJincheng Thanks for your reply.

A few different topics:

  1. call to host_recv_pkt_cb from ISR:I am still convinced that host_recv_pkt_cb should never be called from ISR. As Espressif has the source code for the Bluetooth Controller, it would be great if all calls to host_recv_pkt_cb could be checked and made sure it's never called from IR.

  2. ACL packet with invalid con handle: Even if the host stack sends an ACL packet with an invalid connection handle. In that case, the Bluetooth Controller could just drop the packet.

I don't have a trace for this, but what about my example from May 2021: e.g. music streaming app continuously sends audio data. At one point, the remote side closes the connection and the Bluetooth Controller emits a HCI Disconnect event. In our implementation, this is queued in a ring buffer and processed the next time the Bluetooth Host thread gets executed. It's unclear to me if this can be avoided by any Bluetooth stack 100%. Here, this would be no issue if the Bluetooth Controller just silently drops the packet.

  1. esp_vhci_host_send_packet: esp_vhci_host_send_packet is only called from our main Bluetooth thread.

Thanks

BetterJincheng commented 1 year ago

@mringwal excerpted from Core Specification. FYI~

If the UART synchronization is lost in the communication from Host to Controller, then the Controller shall send an HCI_Hardware_Error event to tell the Host about the synchronization error. 

mringwal commented 1 year ago

@BetterJincheng That's correct. We can handle the HCI Hardware Error as expected (assume Controller has crashed, reset and start over). However, there's no reason that this event is delivered from ISR context.

To clarify, I think a callback should always get called back from the same type of context: either it's always called from a thread context, or, it's always called from ISR context. Otherwise, it's quite challenging to deal with. In any case, it would need to be properly documented.

So, any progress on the two issues:

  1. host_recv_pkt_cb called from ISR - this should not happen and all errors be delivered from the regular Controller thread.
  2. outgoing ACL packets with invalid con handle should get dropped instead of causing a Controller crash/hardware error. See reasoning on async communication above. This is not a UART synchronization error btw.

Thanks!

BetterJincheng commented 1 year ago

@mringwal

  1. We will move the TX of "HW_ERR_EVT" into thread context.
  2. When we received an invalid con handle, we cannot be sure that the package is correct or not: 2.1 Host sent an invalid con handle, Controller received the same con handle; 2.2 Host sent a valid con handle, Controller errorly received an invalid con handle. So in the view of Controller, it is a UART synchronization error.
mringwal commented 1 year ago

Hi @BetterJincheng

Thanks for moving HCI Hardware Error into thread context.

I do agree that an invalid con handle might indicate a UART synchronization error.

However, please consider this scenario:

As this is a valid scanario that cannot be avoided, I think it's better to drop the packet instead of sending HW Error.

A compromise could be to remember which connection was active for e.g. 100 ms after it was disconnected. With that, you could distinguish a packet with a invalid / never-used connection handle from one that has just been disconnected.

In any case, it's good if each new connection gets a new connection handle (wraps at 4096). I don't know if that's the case, if not, that would be good to add.

What do you think?

Cheers, Matthias

BetterJincheng commented 1 year ago

@mringwal

  1. To remember connection status will introduce a timer which increased system complexity.
  2. If Controller just dropped the packets of an invalid con handle. There is something can happen: If Host received the disconnection event, and it still send packets on the disconnected con handle. In this scenario, dropping the packets is not a good choice.
mringwal commented 1 year ago

Hi again. I do agree, there should not be an additional timer.

What about this:

Do you agree that the Controller should not send an HW error in a situation where neither side sent incorrect data? If the Host ist sending incorrect data, there will be other errors, too

BetterJincheng commented 2 months ago

@mringwal ACL data on disconnected handle will be ignore by the Contrller which is accessed by 2fa475bc .

Some information needs to be noted: If the flow control from Host to Controller is enabled, the Host expects to receive an event of NUMBER_OF_COMPLETED_PACKETS when the controller has finished processing a packet. image When the Controller has already sent the event of DISCONNECTION_COMPLETE to the Host, it will no longer report NUMBER_OF_COMPLETED_PACKETS upwards. This means that, The host is responsible for managing flow control related information, such as, credit.

mringwal commented 2 months ago

ACL data on disconnected handle will be ignore by the Contrller which is accessed by 2fa475bc .

That's great!

Thanks for reminding on the Host to Host Controller flow control. After receiving a HCI Disconnected Event, we free the HCI Connection structure which includes the counter for the number of sent packets - so that's synchronized then as well.