crownstone / bluenet

Bluenet is the in-house firmware on Crownstone hardware. Functions: switching, dimming, energy monitoring, presence detection, indoor localization, switchcraft.
https://crownstone.rocks
91 stars 62 forks source link

Events are not properly propagated #59

Closed mrquincle closed 5 years ago

mrquincle commented 6 years ago

Regarding fstorage we use the (experimental) file system.

In cs_Storage.cpp we register through fds_register the fds_evt_handler. The fds_evt_handler is defined in cs_Handlers.cpp and forwards the event simply back to the C++ instance, and in particular the function handleFileStorageEvent.

On executing fds_init, we get an FDS_SUCCESS return code, however, we should wait for FDS_EVT_INIT coming through handleFileStorageEvent. It rarely arrives. It's not the case that it NEVER arrives, it just does it SOMETIMES (once in four?). Note that the chip doesn't block. It's just going on without sending these events. What can be the case here?

Also very awkward and seemingly related to those handlers is when the nrf_sdh_soc.c file is enabled in the CMakeLists.txt file. It's surprising that there is no error if it's not by the way. When it is enabled the code seems to hang in an allocation function in the Balloc module. The code in nrf_sdh_soc.c calls: NRF_SDH_STACK_OBSERVER(..., NRF_SDH_SOC_STACK_OBSERVER_PRIO) = .... This seems to feed the stack events to the SOC observers.

The location of the observers are set in the linker.

/arm-none-eabi-nm crownstone.elf | grep sdh_soc_observers
0004b850 t sdh_soc_observers
0004c530 R __start_sdh_soc_observers
0004c540 R __stop_sdh_soc_observers
0004b540 t sdh_stack_observers
0004c550 R __start_sdh_stack_observers
0004c560 R __stop_sdh_stack_observers

We can check this in gdb:

(gdb) p sdh_soc_observers
Reading 12 bytes @ address 0x0004B850
$1 = {
  section = {
    p_start = 0x4c530 <m_crownstone_soc_observer>,
    p_end = 0x4c540 <m_crownstone_state_handler>
  },
  item_size = 8
}
(gdb) p sdh_stack_observers
Reading 12 bytes @ address 0x0004B540
$4 = {
  section = {
    p_start = 0x4c550 <m_nrf_sdh_soc_evts_poll>,
    p_end = 0x4c560 <m_req_obs>
  },
  item_size = 8
}
(gdb) p sdh_state_observers
Reading 12 bytes @ address 0x0004B564
$2 = {
  section = {
    p_start = 0x4c540 <m_crownstone_state_handler>,
    p_end = 0x4c550 <m_nrf_sdh_soc_evts_poll>
  },
  item_size = 8
}
(gdb) p sdh_ble_observers
Reading 12 bytes @ address 0x0004B824
$5 = {
  section = {
    p_start = 0x4c528 <m_stack>,
    p_end = 0x4c530 <m_crownstone_soc_observer>
  },
  item_size = 8
}
(gdb) p sdh_req_observers
Reading 12 bytes @ address 0x0004B4D8
$7 = {
  section = {
    p_start = 0x4c560 <m_req_obs>,
    p_end = 0x4c568 <m_nrf_log_UART_logs_data_const>
  },
  item_size = 8
}

I'm actually not entire sure how m_nrf_log... gets there. Just grabbing on NRF_SDH_REQUEST_OBSERVER does not reveal much. Perhaps it registers using an internal macro.

Observations:

Apart from fds_register we do not call somewhere an fs_sys_event_handler from a sys_evt_dispatch function. First, I assume that the FDS system is able to do this itself for the underlying fstorage system. Second, this has been changed. There is now a NRF_SDH_SOC_OBSERVER macro with which a similar function, in our case crownstone_soc_evt_handler can be used as callback. However, this does not receive SoC events either...

mrquincle commented 5 years ago

It's hard to figure this out for sure. However, it likely is due to the fact that NRF_LOG_DEFERRED was set to 0. This means that everything is processed in place. You can read more on the logger module at http://infocenter.nordicsemi.com/index.jsp?topic=%2Fcom.nordic.infocenter.sdk5.v15.0.0%2Flib_nrf_log.html. What that means is that you do not need to call NRF_LOG_FLUSH manually, which sounds great! However, in that same document performance is listed:

Number of arguments | In-place with RTT | Deferred
-- | -- | --
0 arguments | 12 us | 2.5 us
6 arguments | 300 us | 4.0 us

That is a huge difference between in-place and deferred. By sprinkling a few strategic NRF_LOG_FLUSH statements and switching to NRF_LOG_DEFERRED = 1 we get suddenly all the callbacks and no issues with the code hanging somewhere. My assumption is that somehow the logging module got in a nested critical area from which it couldn't escape properly. Either because flushing was not fast enough or because of some race condition between flushing and entering/exiting the critical area.

vliedel commented 5 years ago

I think this is fixed, seeing how we wait for FDS_EVT_INIT on boot.

mrquincle commented 2 years ago

There's a bug in the bootloader code in the Nordic SDK:

void nrf_bootloader_app_start(void)
{
    uint32_t start_addr = MBR_SIZE; // Always boot from end of MBR. If a SoftDevice is present, it will boot the app.
    NRF_LOG_DEBUG("Running nrf_bootloader_app_start with address: 0x%08x", start_addr);
    uint32_t err_code;

    // Disable and clear interrupts
    // Notice that this disables only 'external' interrupts (positive IRQn).
    NRF_LOG_DEBUG("Disabling interrupts. NVIC->ICER[0]: 0x%x", NVIC->ICER[0]);

    NVIC->ICER[0]=0xFFFFFFFF;
    NVIC->ICPR[0]=0xFFFFFFFF;
#if defined(__NRF_NVIC_ISER_COUNT) && __NRF_NVIC_ISER_COUNT == 2
    NVIC->ICER[1]=0xFFFFFFFF;
    NVIC->ICPR[1]=0xFFFFFFFF;
#endif

    err_code = nrf_dfu_mbr_irq_forward_address_set();
    if (err_code != NRF_SUCCESS)
    {
        NRF_LOG_ERROR("Failed running nrf_dfu_mbr_irq_forward_address_set()");
    }

    NRF_LOG_FLUSH();
    nrf_bootloader_app_start_final(start_addr);
}

See https://devzone.nordicsemi.com/f/nordic-q-a/54373/uart-logging-backend-hangs-in-nrf_bootloader_app_start/220132#220132

The problem is that interrupts are disabled and then NRF_LOG_FLUSH is called. When NRF_LOG_DEFERRED = 0 there's no problem, because then NRF_LOG_FLUSH is just a void statement. When NRF_LOG_DEFERRED = 1 there is indeed a problem. The flush function will wait on an interrupt and these have been disabled! It will therefore hang.

Note that the same thing might happen if you start flushing from a particular interrupt context where interrupts are disabled (or that have a higher interrupt level). You can log from an interrupt (still not recommended), but you should flush from it.