Closed nathanjel closed 7 years ago
Thanks. I had a look and #233 does not tell me much. I compile in Debug, I have similar issues (different backtrace, but landing in same line in list.c) since some time, not only on recent IDF.
Hi @nathanjel ,
Thanks for reporting this. It looks like the same as an issue that we're also tracking internally, we've seen the same crash but only under high network load conditions.
We're continuing to investigate and will update you as soon as we have some more information.
Angus
Exactly - this happens when the device is "bombarded" with messages during load testing. (a few hundred bytes MQTT message every 1ms). The software itself is generallly a port from our original implementation on TI's CC3200, where load testing was never an issue. Results improved after we implemented rate limiting on ESP32 port (rejecting messages when messages/sec limit exceeded) but never reached the stability of our original implementation - see all my other bug reports.
Did You consider upgrading FreeRTOS from 8.2 to a newer version?
I tried these suggestions - no luck. https://github.com/espressif/esp-idf/blob/master/components/freertos/list.c#L166
Seem to happen easier if code is running on core 1 and it happens both on incoming and outgoing packets. Queues are used throughout the system so I doubt that it has anything to do with the problem. I can also tell you that using raw api and queues does not manifest the problem, so issue is probably somewhere in netconn and it's event logic.
It seems to remain a theme in my app now, if under load. I get an issue in a similar place, few lines later in xTaskRemoveFromEventList, but this time out of normal Queue code, while using a compiled state machine
PC : 0x4008622b PS : 0x00060d33 A0 : 0x80085266 A1 : 0x3ffcf570
A2 : 0x3ffcf778 A3 : 0x3ffcf5f8 A4 : 0x00000008 A5 : 0x3ffcf920
A6 : 0x00000009 A7 : 0x00000000 A8 : 0x00000000 A9 : 0x3ffcf778
A10 : 0x3ffcf778 A11 : 0x00060d23 A12 : 0x00060d23 A13 : 0xffff00ff
A14 : 0xb33f0000 A15 : 0x00060723 SAR : 0x0000001d EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000004 LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0xffffffff
Remote debugging using /dev/ttyUSB1
uxListRemove (pxItemToRemove=0x3ffcf778) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:218
218 if( pxList->pxIndex == pxItemToRemove )
(gdb) bt
#0 uxListRemove (pxItemToRemove=0x3ffcf778) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:218
#1 0x40085266 in xTaskRemoveFromEventList (pxEventList=<optimized out>)
at /home/nathan/xnode/esp-idf/components/freertos/./tasks.c:3017
#2 0x40083a9a in xQueueGenericSend (xQueue=0x3ffcf8c4, pvItemToQueue=<optimized out>, xTicksToWait=0, xCopyPosition=0)
at /home/nathan/xnode/esp-idf/components/freertos/./queue.c:769
#3 0x400f74bd in InternalSend (queue_addr=0x3ffcf8c4, message=STATUS_MESSAGE_CONFIGURATION_SAVE_REQUEST, supplement=0x0)
at /home/nathan/xnode/SNESP32/main/./status.c:346
#4 0x401072d4 in NetFSM_Msg (this=0x3ffcf680, queue_addr=0x3ffcf8c4, message=STATUS_MESSAGE_CONFIGURATION_SAVE_REQUEST)
at /home/nathan/xnode/SNESP32/components/xnode/./netfsm.c:72
#5 0x40106f6c in NetMap_RemoteBrokerLive_EVENT_TIMER_TICK (fsm=0x3ffcf69c)
at /home/nathan/xnode/SNESP32/components/xnode/./stripnode_net_sm.c:948
#6 0x40128ae0 in stripnode_netContext_EVENT_TIMER_TICK (fsm=0x3ffcf69c)
at /home/nathan/xnode/SNESP32/components/xnode/./stripnode_net_sm.c:1708
#7 0x400f7969 in StatusTask (pvParameters=<optimized out>) at /home/nathan/xnode/SNESP32/main/./status.c:506
(gdb)
Two interesting observations
There seems to be corelation between SPI Flash settings and the stability. This could mean a possible hardware issue on my end. Coming down from 80MHz QIO to 26MHz QIO did the trick and reliefed me of most of the occurences of the issue and improved the observed stability of network connection under load. DIO at 80 and 40Mhz did not work well. We have a custom PCB, so multiple things could go wrong.
There seems to be some locking issue - using GDB stub one can notice that there must have been a change in the FreeRTOS Tasks list while the block of code was ran.
The code says the list was not empty (queue.c from FreeRTOS)
else
{
if( listLIST_IS_EMPTY( &( pxQueue->xTasksWaitingToReceive ) ) == pdFALSE )
{
if( xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) != pdFALSE )
{
/* The task waiting has a higher priority so
And GDB shows that at the time of dump, the list was empty
#0 uxListRemove (pxItemToRemove=0x1d) at /home/nathan/xnode/esp-idf/components/freertos/./list.c:214
#1 0x40084cb8 in xTaskRemoveFromEventList (pxEventList=<optimized out>)
at /home/nathan/xnode/esp-idf/components/freertos/./tasks.c:3017
#2 0x400838f4 in xQueueGiveFromISR (xQueue=0x3ffd29c0,
pxHigherPriorityTaskWoken=0x3ffb77c4 <g_lwip_critical_section_needs_taskswitch>)
at /home/nathan/xnode/esp-idf/components/freertos/./queue.c:1369
#3 0x400edf35 in sys_sem_signal (sem=0x3ffd29b4) at /home/nathan/xnode/esp-idf/components/lwip/port/freertos/sys_arch.c:136
#4 0x40109230 in event_callback (conn=<optimized out>, evt=<optimized out>, len=<optimized out>)
at /home/nathan/xnode/esp-idf/components/lwip/api/sockets.c:1920
#5 0x4010b2ba in recv_tcp (arg=0x3ffe8a28, pcb=0x3ffe8c38, p=<optimized out>, err=0 '\000')
at /home/nathan/xnode/esp-idf/components/lwip/api/api_msg.c:281
#6 0x400e9f34 in tcp_input (p=<optimized out>, inp=<optimized out>)
at /home/nathan/xnode/esp-idf/components/lwip/core/tcp_in.c:429
#7 0x400e5679 in ip4_input (p=0x3ffe9970, inp=0x3ffbdb50) at /home/nathan/xnode/esp-idf/components/lwip/core/ipv4/ip4.c:678
#8 0x400e53b9 in ethernet_input (p=0x3ffe9970, netif=0x3ffbdb50)
at /home/nathan/xnode/esp-idf/components/lwip/netif/ethernet.c:176
#9 0x400ee2b8 in tcpip_thread (arg=<optimized out>) at /home/nathan/xnode/esp-idf/components/lwip/api/tcpip.c:142
(gdb) up
#1 0x40084cb8 in xTaskRemoveFromEventList (pxEventList=<optimized out>)
at /home/nathan/xnode/esp-idf/components/freertos/./tasks.c:3017
3017 ( void ) uxListRemove( &( pxUnblockedTCB->xEventListItem ) );
(gdb) print pxUnblockedTCB
$1 = (TCB_t *) 0x1
(gdb) print px
pxCurrentTCB pxEventList pxOverflowTimerList pxStack
pxCurrentTimerList pxIndex pxPortInitialiseStack pxTopOfStack
pxDelayedTaskList pxNext pxPrevious pxUnblockedTCB
pxEnd pxOverflowDelayedTaskList pxReadyTasksLists
(gdb) print px
pxCurrentTCB pxEventList pxOverflowTimerList pxStack
pxCurrentTimerList pxIndex pxPortInitialiseStack pxTopOfStack
pxDelayedTaskList pxNext pxPrevious pxUnblockedTCB
pxEnd pxOverflowDelayedTaskList pxReadyTasksLists
(gdb) print pxCurrentT
pxCurrentTCB pxCurrentTimerList
(gdb) print pxCurrentTCB
$2 = {0x3ffbe0b8, 0x3ffbfa60}
(gdb) print pxE
pxEnd pxEventList
(gdb) print pxEventList
$3 = <optimized out>
(gdb) up
#2 0x400838f4 in xQueueGiveFromISR (xQueue=0x3ffd29c0,
pxHigherPriorityTaskWoken=0x3ffb77c4 <g_lwip_critical_section_needs_taskswitch>)
at /home/nathan/xnode/esp-idf/components/freertos/./queue.c:1369
1369 if( xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) != pdFALSE )
(gdb) print px
pxCurrentTCB pxHigherPriorityTaskWoken pxOverflowTimerList pxQueueSetContainer
pxCurrentTimerList pxIndex pxPortInitialiseStack pxReadyTasksLists
pxDelayedTaskList pxNext pxPrevious
pxEnd pxOverflowDelayedTaskList pxQueue
(gdb) print pxQueue
$4 = (Queue_t * const) 0x3ffd29c0
(gdb) print pxQueue->xTasksWaitingToReceive
$5 = {uxNumberOfItems = 0, pxIndex = 0x3ffd29ec, xListEnd = {xItemValue = 4294967295, pxNext = 0x3ffd29ec,
pxPrevious = 0x3ffd29ec}}
(gdb)
This makes absolute sense, if one realizes that locking on the &xTaskQueueMutex
happens only in the call to xTaskRemoveFromEventList
, while it would make sense to make the lock before we check if the list is empty at all. I guess the idea behind could be that &pxQueue->mux
is locked on anyhow out there, but then again, if a task picks up the event on the other core, it can get out from the list in between the list len macro call and the critical section start in xTaskRemoveFromEventList. This would also explain why most of the errors like this I see are on the Remove step.
This one is definitely a race condition, related to locking issue in a multi-processor environment.
In FreeRTOS/tasks.c tick handler xTaskIncrementTick()
, a task's xEventListItem
can be removed from the containing list, and the mutex is only on xTaskQueueMutex
Now, if the xEventListItem
is on a specific queue's event list, e.g. pxQueue->xTasksWaitingToReceive
, then what can happen is that in the queue manipulating functions, e.g. xQueueGenericSend
we have a check if the queue's waiting to receive list is empty (e.g. listLIST_IS_EMPTY( &( pxQueue->xTasksWaitingToReceive ) ) == pdFALSE
) protected only by queue-specific mutex, but not by xTaskQueueMutex
, so that between that check and the subsequent call xTaskRemoveFromEventList( &( pxQueue->xTasksWaitingToReceive ) ) == pdTRUE
the above mentioned tick handler could actually get called on the other core and remove the item from queue's list and we will get the reported result.
Seems my pull request came in at the same time as this https://github.com/espressif/esp-idf/commit/01c4a980ee05b5eddefd9d01a2bbdecde9944ce7
Anyhow, now the issue does not occur. Thanks.
Thanks for your PR, we have been discussing ways to fix this issue and your looks like a solid one. We will need to have a more detailed look, but I do see a chance this will be merged as a more reliable solution.
Thumbs up for more reliable. "reducing the window where the race condition can happen" does not make me sleep like a baby at night.
Thank You. My initial solution that I applied internally to be able to continue work was very similar to the one in 01c4a98, but the code was more messy and thus I was not offering it to the wider community. The one proposed is elegant, and as the additional list check is being done with the protection of xTaskQueueMutex
it should be solid and reliable enough - as far as I see it the window of opportunity is closed - but happy to learn where Your concerns come from. I also tested this with my heavy-duty code and the problem has not re-occured.
Adding the xTaskMutex around all the invocations indeed seems more robust, but it also is heavyweight as the overhead counted in instructions executed is much larger than the workaround, and it messes more with FreeRTOS internals. I think both are just patches needed to work around fundamental single-core driven FreeRTOS design concepts.
In my head I picture a solution that drops the queue-tasklist coupling completly and replaces it with something - maybe a lock-free ringbuffer that would store queue "events", for consideration by the scheduler...
this issue still exists (sdk v2.1). no heavy load on the network is required - for me it happens when I send a single SSL request just after wifi reconnects details
I am getting this issue as well This started when i create a separate task to send MQTT messages in a for loop. I also have a parallel task which is continuously reading from the CAN bus and a two event loops:
@rebeliousconformist Can you please share the version of IDF you use? The use-case you're describing sounds like a common scenario so should be easily reproducible.
Hi david, im using v4.1-dev-1086-g93a8603c5
Also, with the CAN bus task off, the MQTT connects to the rabbitMQ server perfectly. When both the tasks are running in parallel (MQTT and CAN tx/rx) the MQTT send at most 22 or 23 messages to the server irrespective of the number of topics.
With the CAN bus readout task switched off. the MQTT event loop works fine.
When the MQTT is called from a Task, then it sends only 22-23 messages. I presume it is getting preempted by either the Wifi event loop or the MQTT event loop. I have to cleanup my code before i post anything else.
Also when the semaphores are removed, then the errors go away.
Ill post my code soon
Hi @rebeliousconformist
Any luck with the test code? Could you please post some bits of your test code? Thanks!
Hi David, As there was no reply on the same post i froze the development. i shall post the code today End of Day 1800 CET. I will have to shore up the code.
Regards, Nachiket Bapat
Hi Nachiket,
Thanks for your help and sorry for not responding earlier. Will try to reproduce the issue with your code! Thanks again, David
Hi @david-cermak the event loop crash error is resolved in the latest developments.
Hi @rebeliousconformist
Thanks for the update! Closing the underlying internal tracker then.
https://github.com/espressif/esp-idf/issues/8336 looks like a similar issue with event groups in esp-idf v4.4, does vTaskRemoveFromUnorderedEventList()
also need a similar fix?
Hi, this stills happens in v4.4.2 release. To trigger it you can use led_strip examlple (esp_idf/examples/peripherals/rmt) combined with the connection example. You have to make a freertos task that constantly turn on and turn off the ws2812 rgb leds in intervals of 100 ms. Meanwhile the rgb task is loop, you have to try to connect to wi-fi. This error causes two behaviours, the first is the Loadprohibited guru handler, the second one is that the program freezes at rmt_write_sample
Also, when config the RMT driver I use a clock diviser of 4 instead of 2 as the API suggests.
In a complex UART/MQTT/WebSockets/SSL scenario, under heavy load (UART data in/out processing every 10ms, several os Tasks, bombarded with MQTT messages and sends few MQTT messages per second, using both CPU cores) i randomly run into this repeatable error, which seems to occur as invalid pointer is dereferenced in FreeRTOS queue processing code. The 0x1d could hardly be a valid pointer given ESP32 memory map.
Down to sys_sem_signal it's only the semafor queue object being referenced, and error occurs later down the stream as the semafor owning task is extracted and system-wide event list adjusted, I would suspect the error is occuring somewhere in FreeRTOS internals.
No idea how to fix this so far, no workarounds.