Nimble panics and reboots when nimble_port_stop is called because of too short timeout.

jhnlmn commented 2 years ago

https://github.com/espressif/esp-nimble/blob/1dc1ec6e76b0ab3bf93cc9f1ff7a2a09141e7c61/porting/npl/freertos/src/npl_os_freertos.c#L393

https://github.com/espressif/esp-nimble/blob/1dc1ec6e76b0ab3bf93cc9f1ff7a2a09141e7c61/nimble/host/src/ble_hs_stop.c this calls ble_npl_callout_reset(&ble_hs_stop_terminate_tmo, ble_npl_time_ms_to_ticks32(BLE_HOST_STOP_TIMEOUT_MS) which converts BLE_HOST_STOP_TIMEOUT_MS (which is 2000) to 200 if CONFIG_FREERTOS_HZ=100 But https://github.com/espressif/esp-nimble/blob/1dc1ec6e76b0ab3bf93cc9f1ff7a2a09141e7c61/porting/npl/freertos/src/npl_os_freertos.c calls esp_timer_start_once(co->handle, ticks*1000), so 200 ticks is converted to 200,000 microseconds, which is too short - very often connections are not closed within 200 ms and then I get panic: assert failed: spinlock_acquire spinlock.h:123 ((result == SPINLOCK_FREE) == (lock->count == 0))

So, there are 2 issues here:

Simple, replace ticks1000 by ble_npl_time_ticks_to_ms32(ticks)1000LL
but what if 2 second timeout is not enough? Would it panic again? Is it possible to cleanup connections cleanly, without panics or crashes, regardless whether timeout happened or not?

PS: I found that the problem is worse: some callers ble_npl_callout_reset are passing values in ticks, for example in https://github.com/espressif/esp-nimble/blob/1dc1ec6e76b0ab3bf93cc9f1ff7a2a09141e7c61/nimble/transport/socket/src/ble_hci_socket.c but other places, such as: https://github.com/espressif/esp-nimble/blob/1dc1ec6e76b0ab3bf93cc9f1ff7a2a09141e7c61/nimble/controller/src/ble_ll_conn.c use timeout from BLE_LL_CONN_AUTH_PYLD_OS_TMO, which is: BLE_LL_CONN_AUTH_PYLD_OS_TMO(x) ble_npl_time_ms_to_ticks32((x) * 10) so, it multiplies ticks by 10. So, somebody should verify all calls to ble_npl_callout_reset and make sure they are in the same units.

ESPAbhinav commented 1 year ago

Hi @jhnlmn,

https://github.com/espressif/esp-nimble/blob/1dc1ec6e76b0ab3bf93cc9f1ff7a2a09141e7c61/nimble/host/src/ble_hs_stop.c this calls ble_npl_callout_reset(&ble_hs_stop_terminate_tmo, ble_npl_time_ms_to_ticks32(BLE_HOST_STOP_TIMEOUT_MS) which converts BLE_HOST_STOP_TIMEOUT_MS (which is 2000) to 200 if CONFIG_FREERTOS_HZ=100 and CONFIG_BT_NIMBLE_USE_ESP_TIMER flag is disabled.

By default, the CONFIG_BT_NIMBLE_USE_ESP_TIMER flag is enabled and thus ble_npl_time_ms_to_ticks32(BLE_HOST_STOP_TIMEOUT_MS) converts 2000 ms to 2000 ticks which is converted to 2000000 microseconds (2 seconds) which is sufficient timeout.

Also replacing (ticks 1000) by ble_npl_time_ticks_to_ms32(ticks) 1000LL will be useful if and only if the CONFIG_BT_NIMBLE_USE_ESP_TIMER flag is disabled but if the CONFIG_BT_NIMBLE_USE_ESP_TIMER is disabled then return esp_err_to_npl_error(esp_timer_start_once(callout->handle, ticks*1000)) will not be executed. Hence replacement is not useful in the current situation where is flag is already enabled.

All the calls to npl_freertos_callout_reset(struct ble_npl_callout *co, ble_npl_time_t ticks) inside the nimble host have received parameters in the same units via ble_npl_time_ms_to_ticks32() or ble_npl_time_ms_to_ticks().

Please note, that ESP-IDF takes the host part of the upstream nimble. Hence, all instances you pointed out in ll files are controller-related and not maintained by espressif. So, I suggest you can open a ticket on upstream mynewt-nimble for this.

Thanks, Abhinav

rahult-github commented 4 months ago

Closing this. Please feel free to reopen in case of any host related issue.

espressif / esp-nimble

Nimble panics and reboots when nimble_port_stop is called because of too short timeout. #41