espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.74k stars 7.3k forks source link

Tasks freezing, wifi disconnection, reconnection impossible (IDFGH-3611) #5551

Open ghost opened 4 years ago

ghost commented 4 years ago

Environment

Problem Description

On the same project as that described here: #5122.

I have an esp32 using wifi and bluetooth:

  • A lot of datas are received via the BLE.
  • These datas are send to a server with mqtt.

After a while, the queue of the btc_thread start to fill up. When the queue is full, the queue of the btu_thread start to fill up...

Then, I have a wifi timeout disconnection (WIFI_REASON_BEACON_TIMEOUT) and it's impossible to reconnect to the wifi (WIFI_REASON_AUTH_EXPIRE).

Some tasks seems to freeze: hciT, btuT, BTC_TASK, sys_evt, esp_timer, ipc0 and ipc1.

log_work_queue_250.log

Note:

Alvin1Zhang commented 4 years ago

Thanks for reporting, we will look into.

liuzfesp commented 4 years ago

Hi @pbruynbroeckQuimesis please help to provide following info:

ghost commented 4 years ago

Hello @liuzfesp,

In this project, I have the periodic_task. This task does different things including:

All theses features works fine in a normal case.

(We see in the log that the perdioc_task is also freeze, that's why we have no more attempt to reconnect to the wifi).

However, I thinks the root cause is linked to the queues used by btc_thread and btu_thread.

What can block the processing of theses queues? What start by fill up the btc queue, and only when the btc queue is full, start to use the btu queue?

Unfortunately, it's a big project and it will take a lot of time to simplify this project and create additional scripts to simulate interactions with other devices. And also, I have the feeling that the issue occurs more often when all functionalities are working in same times (play music, wifi, BLE, mqtt and uart communication).

WCCWCC commented 4 years ago

Hi, @pbruynbroeckQuimesis

gadget-man commented 4 years ago

Looking at the log, this looks very similar to the problem I'm reporting in https://github.com/espressif/esp-idf/issues/5351#issuecomment-657337155. Everything is fine until randomly a wifi:bcn_timout,ap_probe_send_start with 200 beacon timeout, and the device never recovers until a reboot.

ghost commented 4 years ago

Hi, @pbruynbroeckQuimesis

  • Maybe you can increase the queue of btc and btu according to your actual needs.
  • If there are a large number of adv packets in the environment during ble scan, it may cause btc queue to block. You can speed up the execution speed of ble_gap_cb and reduce the time of ble scan.
  • If you still can't solve the problem, I will provide a patch based on your commit, When the problem occurs, the content of btc queue and btu queue will be typed out to see what data causes a sudden increase

Hi @WCCWCC , When the queue btc/btu started to fill up, the ESP32 was working normally. It was connected to my other device via BLE. (The ESP32 wasn't in 'scanning mode').

ghost commented 4 years ago

Thanks @gadget-man, Do you have also the btc and btu queue size increasing just before disconnection?

(To display the size of theses queue, I use osi_thread_queue_wait_size inside btc_transfer_context and btu_task_post

bt_status_t btc_transfer_context(btc_msg_t *msg, void *arg, int arg_len, btc_arg_deep_copy_t copy_func)
{
    static int queue_size = 0;
    static int prev_queue_size = 0;
    queue_size = osi_thread_queue_wait_size(btc_thread, 0);

    if(queue_size > prev_queue_size)
    {
        ESP_LOGI("BTC", "btc_thread waterlevel %d", queue_size);
        prev_queue_size = queue_size;
    }
...

)

gadget-man commented 4 years ago

Thanks @gadget-man, Do you have also the btc and btu queue size increasing just before disconnection?

(To display the size of theses queue, I use osi_thread_queue_wait_size inside btc_transfer_context and btu_task_post

bt_status_t btc_transfer_context(btc_msg_t *msg, void *arg, int arg_len, btc_arg_deep_copy_t copy_func)
{
    static int queue_size = 0;
    static int prev_queue_size = 0;
    queue_size = osi_thread_queue_wait_size(btc_thread, 0);

    if(queue_size > prev_queue_size)
    {
        ESP_LOGI("BTC", "btc_thread waterlevel %d", queue_size);
        prev_queue_size = queue_size;
    }
...

)

I have Bluetooth disabled, is this relevant in my case?

ghost commented 4 years ago

I thought that was because btu / btc queues was full that I have wifi: bcn_timout, ap_probe_send_start but in fact not.. I see in my log:

I (4528853) BTC: btc_thread waterlevel 118
I (4528853) BTC: btc_thread waterlevel 119
I (4528873) wifi:bcn_timout,ap_probe_send_start
I (4528893) BTC: btc_thread waterlevel 120
I (4528933) BTC: btc_thread waterlevel 121
[...]
I (4541963) BTU: btu_thread waterlevel 249
I (4541993) BTU: btu_thread waterlevel 250
I (4542713) wifi:bcn_timout,ap_probe_send_start
I (4545233) wifi:ap_probe_send over, resett wifi status to disassoc
I (4545233) wifi:state: run -> init (c800)
I (4545243) wifi:pm stop, total sleep time: 2863614744 us / 4524970362 us
I (4545243) wifi:new:<1,0>, old:<1,0>, ap:<255,255>, sta:<1,0>, prof:1
I (4545253) WIFI-Manager: SYSTEM_EVENT_STA_DISCONNECTED: 200 (WIFI_REASON_BEACON_TIMEOUT)
heyinling commented 4 years ago

@pbruynbroeckQuimesis BT queue full is not expected to happen. Is this issue only happens when BT queue size full? Can you post the code that lead to BT queue full?

There're some ways to do flow control in application side (you can check the BLE throughput examples) to prevent queue full.

ghost commented 4 years ago

As I said, the ESP works fine: it receives data from BLE and send it over MQTT. (In the same time, it reads data from memory to play music with i2s, it communicates with another device with UART,... so the load is heavy but all works fine.)

But, after a while (here: 4528853 ticks => 75 min), btu/btc queues starts to fill up but without any reason... so I'm not able to find the code that lead to BT queue full.

This is why I asked in https://github.com/espressif/esp-idf/issues/5551#issuecomment-656529024:

What can block the processing of theses queues? What start by fill up the btc queue, and only when the btc queue is full, start to use the btu queue?

monkeytronics commented 1 year ago

Was this ever resolved? I am seeing the exact same behaviour with IDF v4.3.2. Bluetooth off. Nothing intensive going on. Every few weeks, it just disconnects on certain networks with error 200 (WIFI_REASON_BEACON_TIMEOUT) ! These are networks with multiple APs if that's relevant. Also potentially relevant, esp_wifi_set_ps(WIFI_PS_MAX_MODEM)... I noticed that the network was falling out during OTA until I set esp_wifi_set_ps(WIFI_PS_NONE); just before the intensive flash erase etc begun. With this setting it runs fine... So, WIFI_PS_MAX_MODEM is clearly not reliable.

Xiehanxin commented 1 year ago

hi @monkeytronics , esp_wifi_set_ps is used to configure whether to keep the RF on during the interval between two beacon intervals, so if use WIFI_PS_NONE it can keep the RF on all the time and if the wifi environment is bad it can enhance the wifi performance.

AxelLin commented 1 year ago

hi @monkeytronics , esp_wifi_set_ps is used to configure whether to keep the RF on during the interval between two beacon intervals, so if use WIFI_PS_NONE it can keep the RF on all the time and if the wifi environment is bad it can enhance the wifi performance.

@Xiehanxin I don't get your point. The bug report here is not about performance. Since esp-idf supports setting WIFI_PS_MIN_MODEM and WIFI_PS_MAX_MODEM, they should work as well.