espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.02k stars 7.12k forks source link

httpd_queue_work drops work when called in quick succession (IDFGH-6813) #8440

Closed Barabas5532 closed 2 years ago

Barabas5532 commented 2 years ago

Environment

Problem Description

When httpd_queue_work is called in a tight loop, some of the work never gets executed. There are no warnings printed, the work function just never runs.

Expected Behavior

Every call to httpd_queue_work causes a call to the work function.

Actual Behavior

Sometimes the work function doesn't run after a call to httpd_queue_work.

Steps to reproduce

  1. Flash my minimal reproduction code to a devkit
  2. Set up Wi-Fi credentials: https://github.com/espressif/esp-idf/blob/master/examples/protocols/README.md#establishing-wi-fi-or-ethernet-connection
  3. observe that the work function is called less times than expected.

I've added an assert that should fail when working as expected. This does fail if a 100 ms delay is introduced into the loop that calls httpd_queue_work.

Code to reproduce this issue

https://github.com/ShrapnelDSP/ShrapnelMonorepo/tree/httpd-work-queue-leak-bug/bug-demo

Debug Logs

// Without added delay, work only runs 8 times
I (5545) main: work 1
I (5549) main: work 2
I (5552) main: work 3
I (5558) main: work 4
I (5570) main: work 5
I (5588) main: work 6
I (5597) main: work 7
I (5600) main: work 8
// With 100ms delay in loop, working as expected
I (5138) main: work 1
I (5237) main: work 2
I (5337) main: work 3
I (5437) main: work 4
I (5538) main: work 5
I (5638) main: work 6
I (5737) main: work 7
I (5837) main: work 8
I (5938) main: work 9
I (6037) main: work 10
I (6137) main: work 11
I (6237) main: work 12
I (6338) main: work 13
I (6437) main: work 14
I (6538) main: work 15
I (6637) main: work 16
I (6738) main: work 17
I (6837) main: work 18
I (6937) main: work 19
I (7038) main: work 20

ssert failed: void work(void*) main.cpp:158 ((!(i == 20)) && "working as expected")

Other items if possible

hmalpani commented 2 years ago

Hello @Barabas5532 Thanks for reporting the issue. We are trying to solve this issue. Until then can you please provide some information on why you are calling this function in quick succession?

Barabas5532 commented 2 years ago

My project is a DSP audio processor for guitar. There are DSP algorithms running in the ESP32, and each of these has a number of parameters. The frontend is connected to the ESP32 using websockets. The frontend initialises itself by requesting the value of all the audio parameters. In response, the ESP32 sends a websocket frame for each parameter. This causes many websocket messages to be queued into the httpd work queue at the same time.

I have found a workaround by limiting the amount of queued work to a single item (https://github.com/ShrapnelDSP/ShrapnelMonorepo/commit/7139c99115cad40148c5c22d1a5aef08ffac776b), I would like to be able to remove this.

The project is open source, the relevant parts are:

hmalpani commented 2 years ago

Hello @Barabas5532, Please try the attached patch and let me know if it fixes your issue. I have created a new API httpd_queue_work_sync(). Try using it instead of httpd_queue_work() 0001-esp_http_server-Add-httpd_queue_work_sync-API-for-sy.zip

This issue can also be solved by increasing the size of the UDP mailbox in menuconfig>component_config>LWIP>UDP.

Thanks, Harshit

martinKupec commented 4 weeks ago

Hi, I were also hit by this issue, but I cannot serialize the queue with semaphores as I need it asynchronous. I have found the root problem. It is shortage of MEMP_NETBUF buffers in LWIP. It fails in LWIP on this line: https://github.com/espressif/esp-lwip/blob/a45be9e438f6cf9c54ec150581819c3b95d5af6b/src/api/api_msg.c#L183

This can be solved by rising MEMP_NUM_NETBUF to some higher value (each buffer is about 36 bytes of size).

It also seems that lowering LWIP_LOOPBACK_MAX_PBUFS will at least notify the sender, that the work will never be done.

It is my understanding, that a proper fix would be to stop using socket as messaging layer and use some thread safe queue directly as all but one mallocs happen in the middle layers and are fundamentally not needed.