bouffalolab / bl_iot_sdk

BL602/BL702 SDK. Any technical topic, please access the following link.
http://bbs.bouffalolab.com
Apache License 2.0
260 stars 170 forks source link

BL702 BLE: assertion trap in binary library while intensive RX/TX #76

Open mdednev opened 9 months ago

mdednev commented 9 months ago

Hello,

I'm getting errors while trying to make intensive simultaneous data transfer over BLE with BL702, configured as peripheral BLE device:

lld_evt_end:rx_desc_cnt(3) is less than evt->rx_cnt(8),assert!!!
, lld_evt_end
lld_evt_end:num of prog evt (0)!!!
, lld_evt_end
lld_evt_end:evt_mode(4), cur_rx_desc_cnt(8)
, lld_evt_end
ble_assert_param,return address 0x23014e48
, ble_assert_param
ble_assert_param, param0=0x3, param1=0x8, file=/home/yhqin/work/bl_iot_sdk/components/network/ble/blecontroller/ip/ble/ll/src/lld/lld_evt.c, line=3462
, ble_assert_param
ble_assert_param,return address 0x23014c80
, ble_assert_param
ble_assert_param, param0=0x3, param1=0x8, file=/home/yhqin/work/bl_iot_sdk/components/network/ble/blecontroller/ip/ble/ll/src/lld/lld_evt.c, line=3468
, ble_assert_param

I'm using latest m0s1 static library from master branch of bl_iot_sdk repository and RTL8761BU USB Bluetooth adapter under Linux. I've tried std static library and it behaves exactly in the same way. LE MTU is 247 bytes. My Linux test software sends data stream to device (peripheral) at about 8 kB/s speed using GATT write without response command and at the same time device firmware sends another data stream to host (central) at about 8 kB/s using notifications. If data streams are synchronized (i.e. BL702 device waits for data write, receives and sends received data back to host), then I've get such errors (assertions) very rarely. But if data streams are simultaneous and asynchronous, I'm getting assertion after about five seconds of data transfer testing.

My investigation through disassembly and code analysis shows, that at the end of BLE transfer packet sequence in this erroneous condition transfer counter rx_cnt (received packets counter, maintained by software), becomes greater than rx_desc_cnt received packet counter, maintained by BLE hardware controller in EM (Exchange Memory) shared memory area. In the example above software thinks that it received 8 (eleven!) packets, but there were only 3 (three!) packets processed by hardware (reported by hardware RX descriptor counter). This condition triggers the assertion, because hardware should process at least the same packets count as software does. So this shouldn't be but it is here.

It seems, that the main issue is the transfer end timeout "interrrupt" handling, implemented in lld_evt_slot_isr() handler. This functions tries to finish timed out events by calling lld_evt_end_isr(false) in case of passed transfer time:

                    /* BLE_EVENTAPFAINTSTAT_BIT | BLE_EVENTINTSTAT_BIT */
  if ((irq_stat & 0x108) == 0) {
                    /* BLE_RXINTSTAT_BIT */
    g_endisr_miss_cnt = g_endisr_miss_cnt + 1;
    if ((irq_stat & 2) != 0) {
      BLE_INTACK_REG = 2;
    }
    lld_evt_end_isr(false);
    result = 1;
  }

In this case IRQ status is checked against pending end of transfer interrupt flags and if there is no such flags lld_evt_end_isr(false) gets called. But this leads to the potential race condition if in some later moment, when lld_evt_slot_isr() returns one (1) external interrupt flags variable gets updated with newer interrupt status register value, having one of "end of transfer" flags set. So, the same condition will be processed two times instead of exactly one time. It seems for me, that it is an logical error. Interrupt handler lld_evt_end_isr(false) should not be called without actual interrupt status bits were set.

I've tried to override lld_evt_slot_isr(void) with my simple implementation, calling two functions:

extern void rwip_wakeup_end(void);
extern void ea_finetimer_isr(void);

int __wrap_lld_evt_slot_isr(void)
{
    // Handle end of wake-up
    rwip_wakeup_end();

    // Try to schedule immediately
    ea_finetimer_isr();

    return 0;
}

And this solves the data transfer error. I've used ld (linker) parameter --wrap=lld_evt_slot_isr to override original lld_evt_slot_isr() calls in the provided by bl_io_sdk repository static library.

So I definitely sure, that there is an error in software library, that triggers assertion in intensive data transfer applications. Please, explain the implemented logic or change it to prevent possible race conditions and assertion triggering.

Here is BL702 MCU log data, captured before assertion takes places:

[19608.12h.03m.14s.391] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=0
[19608.12h.03m.14s.423] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=0
[19608.12h.03m.14s.439] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=0
[19608.12h.03m.14s.471] BLE_TXRXDESCCNT = 0x0200, g_endisr_miss_cnt=0
[19608.12h.03m.14s.487] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=0
[19608.12h.03m.14s.519] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=0
[19608.12h.03m.14s.535] BLE_TXRXDESCCNT = 0x0101, g_endisr_miss_cnt=0
[19608.12h.03m.14s.567] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=0
[19608.12h.03m.14s.583] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=0
[19608.12h.03m.14s.594] lld_evt_end:rx_desc_cnt(1) is less than evt->rx_cnt(2),assert!!!
[19608.12h.03m.14s.594] , lld_evt_end
[19608.12h.03m.14s.594] lld_evt_end:num of prog evt (0)!!!
[19608.12h.03m.14s.594] , lld_evt_end
[19608.12h.03m.14s.594] lld_evt_end:evt_mode(4), cur_rx_desc_cnt(2)
[19608.12h.03m.14s.594] , lld_evt_end
[19608.12h.03m.14s.594] ble_assert_param,return address 0x230148f2
[19608.12h.03m.14s.594] , ble_assert_param
[19608.12h.03m.14s.594] ble_assert_param, param0=0x1, param1=0x2, file=/home/yhqin/work/bl_iot_sdk/components/network/ble/blecontroller/ip/ble/ll/src/lld/lld_evt.c, line=3462
[19608.12h.03m.14s.594] , ble_assert_param
[19608.12h.03m.14s.594] ble_assert_param,return address 0x2301472a
[19608.12h.03m.14s.594] , ble_assert_param
[19608.12h.03m.14s.594] ble_assert_param, param0=0x1, param1=0x2, file=/home/yhqin/work/bl_iot_sdk/components/network/ble/blecontroller/ip/ble/ll/src/lld/lld_evt.c, line=3468
[19608.12h.03m.14s.610] , ble_assert_param
[19608.12h.03m.14s.610] BLE_TXRXDESCCNT = 0x0200, g_endisr_miss_cnt=1
[19608.12h.03m.14s.626] BLE_TXRXDESCCNT = 0x0201, g_endisr_miss_cnt=3
[19608.12h.03m.14s.658] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.674] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.706] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.722] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.754] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.770] BLE_TXRXDESCCNT = 0x0201, g_endisr_miss_cnt=4
[19608.12h.03m.14s.802] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.818] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.850] BLE_TXRXDESCCNT = 0x0101, g_endisr_miss_cnt=4
[19608.12h.03m.14s.866] BLE_TXRXDESCCNT = 0x0100, g_endisr_miss_cnt=4
[19608.12h.03m.14s.898] BLE_TXRXDESCCNT = 0x0101, g_endisr_miss_cnt=4

As you can see from log data above g_endisr_miss_cnt gets incremented exactly after assertion message. So I'm sure, that such assertions are direct consequences of lld_evt_end_isr(false) uncorrect calls in lld_evt_slot_isr() handler.

I suppose, that BL702 (and may be other MCUs) uses RivieraWaves Bluetooth IP core and driver software and I definitely sure, that such serious bug should be translated to RivieraWaves representatives to find a right way to solve this issue. I've tried various workarounds (f.e. increasing connection interval, lowering MTU size, etc), but nothing helps. The only way to solve this issue at the current moment is using lld_evt_slot_isr() wrapper, described above.

If lld_evt_slot_isr() was implemented by BouffaloLab I kindly ask developers to check it's implementation for BL702 and other MCUs to fix described errors.