FreeRTOS / iot-reference-esp32

MIT License
77 stars 47 forks source link

[BUG] Random LoadProhibited when publishing command callback #50

Closed law-ko closed 8 months ago

law-ko commented 1 year ago

The crash will occur after around 20000000 ticks, and happens when trying to send a new publish to AWS IoT. The program is set to publish to AWS IoT every 30 seconds.

The error occurs when performing xTaskNotify, which is coming from the example function provided in this repo.

System information

Expected behavior A clear description of the expected behavior.

Screenshots or console output

I (21102850) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (21102850) coreMQTT: State record updated. New state=MQTTPublishDone.
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x400964de  PS      : 0x00060f33  A0      : 0x800e1b51  A1      : 0x3ffe7170  
0x400964de: xTaskGenericNotify at /Users/test/esp/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:5718

A2      : 0x0000cdcd  A3      : 0x3ffbdc38  A4      : 0x80096c4c  A5      : 0x00000003  
A6      : 0x00000000  A7      : 0x00000031  A8      : 0x800964ce  A9      : 0x0000cecd  
A10     : 0x00000001  A11     : 0x3ffc3638  A12     : 0x00060f20  A13     : 0x00060f23  
A14     : 0x0000000f  A15     : 0x0000cdcd  SAR     : 0x00000004  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x0000cf21  LBEG    : 0x400014fd  LEND    : 0x4000150d  LCOUNT  : 0xfffffffc  

Backtrace: 0x400964db:0x3ffe7170 0x400e1b4e:0x3ffe7190 0x400e7919:0x3ffe71b0 0x400e7b50:0x3ffe71f0 0x400e9a11:0x3ffe7210 0x400e9b51:0x3ffe7290 0x400e7e69:0x3ffe72b0 0x400dbea8:0x3ffe7300 0x40096c49:0x3ffe7320
0x400964db: xTaskGenericNotify at /Users/test/esp/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:5718

0x400e1b4e: prvPublishCommandCallback at /Users/test/esp-git/esp-matter/examples/im/outlet/main/aws_iot/fleet_prov_by_claim_demo.c:380

0x400e7919: concludeCommand at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:808

0x400e7b50: handleAcks at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:638
 (inlined by) mqttEventCallback at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:700

0x400e9a11: handlePublishAcks at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT/coreMQTT/source/core_mqtt.c:1555
 (inlined by) handleIncomingAck at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT/coreMQTT/source/core_mqtt.c:1601
 (inlined by) receiveSingleIteration at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT/coreMQTT/source/core_mqtt.c:1756

0x400e9b51: MQTT_ProcessLoop at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT/coreMQTT/source/core_mqtt.c:3065

0x400e7e69: processCommand at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:611
 (inlined by) MQTTAgent_CommandLoop at /Users/test/esp-git/esp-matter/libraries/esp-aws-iot/libraries/coreMQTT-Agent/coreMQTT-Agent/source/core_mqtt_agent.c:1059

0x400dbea8: prvMQTTAgentTask at /Users/test/esp-git/esp-matter/examples/im/outlet/main/aws_iot/networking/mqtt/core_mqtt_agent_manager.c:521

0x40096c49: vPortTaskWrapper at /Users/test/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:154
static void prvPublishCommandCallback( MQTTAgentCommandContext_t * pxCommandContext,
                                       MQTTAgentReturnInfo_t * pxReturnInfo )
{
    /* Store the result in the application defined context so the task that
     * initiated the publish can check the operation's status. */
    pxCommandContext->xReturnStatus = pxReturnInfo->returnCode;

    if( pxCommandContext->xTaskToNotify != NULL )
    {
        /* Send the context's ulNotificationValue as the notification value so
         * the receiving task can check the value it set in the context matches
         * the value it receives in the notification. */
        xTaskNotify( pxCommandContext->xTaskToNotify,
                     pxCommandContext->ulNotificationValue,
                     eSetValueWithOverwrite ); # LINE 380
    }
}

static void prvPublishToTopic( MQTTQoS_t xQoS,
                               char * pcTopicName,
                               char * pcPayload )
{
    uint32_t ulPublishMessageId, ulNotifiedValue = 0;

    MQTTStatus_t xCommandAdded;
    BaseType_t xCommandAcknowledged = pdFALSE;

    MQTTPublishInfo_t xPublishInfo = { 0 };

    MQTTAgentCommandContext_t xCommandContext = { 0 };
    MQTTAgentCommandInfo_t xCommandParams = { 0 };

    xTaskNotifyStateClear( NULL );

    /* Create a unique number for the publish that is about to be sent.
     * This number is used in the command context and is sent back to this task
     * as a notification in the callback that's executed upon receipt of the
     * publish from coreMQTT-Agent.
     * That way this task can match an acknowledgment to the message being sent.
     */
    xSemaphoreTake( xMessageIdSemaphore, portMAX_DELAY );
    {
        ++ulMessageId;
        ulPublishMessageId = ulMessageId;
    }
    xSemaphoreGive( xMessageIdSemaphore );

    /* Configure the publish operation. The topic name string must persist for
     * duration of publish! */
    xPublishInfo.qos = xQoS;
    xPublishInfo.pTopicName = pcTopicName;
    xPublishInfo.topicNameLength = ( uint16_t ) strlen( pcTopicName );
    xPublishInfo.pPayload = pcPayload;
    xPublishInfo.payloadLength = ( uint16_t ) strlen( pcPayload );

    /* Complete an application defined context associated with this publish
     * message.
     * This gets updated in the callback function so the variable must persist
     * until the callback executes. */
    xCommandContext.ulNotificationValue = ulPublishMessageId;
    xCommandContext.xTaskToNotify = xTaskGetCurrentTaskHandle();

    xCommandParams.blockTimeMs = fleetprovbyclaimconfigMAX_COMMAND_SEND_BLOCK_TIME_MS;
    xCommandParams.cmdCompleteCallback = prvPublishCommandCallback;
    xCommandParams.pCmdCompleteCallbackContext = &xCommandContext;

    do
    {
        /* Wait for coreMQTT-Agent task to have working network connection and
         * not be performing an OTA update. */
        xEventGroupWaitBits( xNetworkEventGroup,
                             CORE_MQTT_AGENT_CONNECTED_BIT | CORE_MQTT_AGENT_OTA_NOT_IN_PROGRESS_BIT,
                             pdFALSE,
                             pdTRUE,
                             portMAX_DELAY );

        ESP_LOGI( TAG,
                  "Sending publish request to coreMQTT-Agent with message \"%s\" on topic \"%s\" with ID %ld.",
                  pcPayload,
                  pcTopicName,
                  ulPublishMessageId );

        /* To ensure ulNotification doesn't accidentally hold the expected value
         * as it is to be checked against the value sent from the callback.. */
        ulNotifiedValue = ~ulPublishMessageId;

        xCommandAcknowledged = pdFALSE;

        xCommandAdded = MQTTAgent_Publish( &xGlobalMqttAgentContext,
                                           &xPublishInfo,
                                           &xCommandParams );

        if( xCommandAdded == MQTTSuccess )
        {
            /* For QoS 1 and 2, wait for the publish acknowledgment.  For QoS0,
             * wait for the publish to be sent. */
            ESP_LOGI( TAG,
                      "Waiting for publish %ld to complete.",
                      ulPublishMessageId );

            xCommandAcknowledged = prvWaitForNotification( &ulNotifiedValue );
        }
        else
        {
            ESP_LOGE( TAG,
                      "Failed to enqueue publish command. Error code=%s",
                      MQTT_Status_strerror( xCommandAdded ) );
        }

        /* Check all ways the status was passed back just for demonstration
         * purposes. */
        if( ( xCommandAcknowledged != pdTRUE ) ||
            ( xCommandContext.xReturnStatus != MQTTSuccess ) ||
            ( ulNotifiedValue != ulPublishMessageId ) )
        {
            ESP_LOGW( TAG,
                      "Error or timed out waiting for ack for publish message %ld. Re-attempting publish.",
                      ulPublishMessageId );
        }
        else
        {
            ESP_LOGI( TAG,
                      "Publish %ld succeeded.",
                      ulPublishMessageId );
        }
        // @note newly added
        /* Add a little randomness into the delay so the tasks don't remain
         * in lockstep. */
        vTaskDelay( pdMS_TO_TICKS( 125 ) +
                        ( rand() % 0xff ) );

    } while( ( xCommandAcknowledged != pdTRUE ) ||
             ( xCommandContext.xReturnStatus != MQTTSuccess ) ||
             ( ulNotifiedValue != ulPublishMessageId ) );
}

Thank you.

chinglee-iot commented 1 year ago

Thank you for reporting this problem. We will look into this problem and discuss with you here.

chinglee-iot commented 1 year ago

@law-ko Can you help to provide the following information?

law-ko commented 1 year ago

@chinglee-iot

Your esp-idf version: v5.0.1 Your iot-reference-esp32c3 tag or commit: latest The application code which calls prvPublishToTopic:

char *jsonString = cJSON_PrintUnformatted(rootJSON);
prvPublishToTopic( xQoS, pcPubTopicBuffer, (char *) jsonString );

We have adjusted the code into our app that work with AWS IoT with desired and reported style. This is uncommon error to see and looking deeper into the code we don't know what was causing this.

chinglee-iot commented 1 year ago

The exception you shared indicates a LoadProhibited exception in the tasks.c. From the tasks.c in v5.0.1, the xTaskToNotify is passed by parameters in prvPublishCommandCallback. It looks like xTaskToNotify points to an invalid address.

    BaseType_t xTaskGenericNotify( TaskHandle_t xTaskToNotify,
                                   UBaseType_t uxIndexToNotify,
                                   uint32_t ulValue,
                                   eNotifyAction eAction,
                                   uint32_t * pulPreviousNotificationValue )
    {
            ...
            ucOriginalNotifyState = pxTCB->ucNotifyState[ uxIndexToNotify ]; /* This line has LoadProhibited exception. */

xTaskNotify(xTaskGenericNotify) is called in the prvPublishCommandCallback and is a member variable of input parameter pxCommandContext. pxCommandContext points to a local variable xCommandContext in prvPublishToTopic. Therefore, it is on the stack of the task which calls prvPublishToTopic.

static void prvPublishToTopic( MQTTQoS_t xQoS,
                               char * pcTopicName,
                               char * pcPayload )
{
...
    MQTTAgentCommandInfo_t xCommandParams = { 0 };
...
    xCommandContext.ulNotificationValue = ulPublishMessageId;
    xCommandContext.xTaskToNotify = xTaskGetCurrentTaskHandle();

    xCommandParams.blockTimeMs = subpubunsubconfigMAX_COMMAND_SEND_BLOCK_TIME_MS;
    xCommandParams.cmdCompleteCallback = prvPublishCommandCallback;
    xCommandParams.pCmdCompleteCallbackContext = &xCommandContext;

Base on these information, we can troubleshoot the following possible causes:

  1. Stack overflow Local variable xCommandContext in prvPublishToTopic may be corrupted if there is stack overflow. FreeRTOS supports stack overflow detection mechanism. You can reference this page to enable stack overflow checking.

  2. Use after free The prvPublishCommandCallback is called from another task. If the task calls prvPublishToTopic exited, the stack memory will be used by other task. The pxCommandContext may point to memory modified by other task. We would like to suggest to check the status of the task which calls prvPublishToTopic.

We also suggest you to run the sub_pub_unsub_demo only on your platform to see if this problem can be reproduced and feedback the result. This can help us to narrow down the problem.

law-ko commented 11 months ago

@chinglee-iot Would it be possible that this crash is related to the MQTT_ProcessLoop being blocked like xEventGroupWaitBit?

chinglee-iot commented 11 months ago

@law-ko MQTT_ProcessLoop is blocked by transport interface receive function in receiveSingleIteration and is only called in the prvMQTTAgentTask.

static MQTTStatus_t receiveSingleIteration( MQTTContext_t * pContext,
                                            bool manageKeepAlive )
{
    ....
    recvBytes = pContext->transportInterface.recv( pContext->transportInterface.pNetworkContext,
                                                   &( pContext->networkBuffer.pBuffer[ pContext->index ] ),
                                                   pContext->networkBuffer.size - pContext->index );

Could you help to provide the xEventGroupWaitBits call in question so we can discuss it with you here?

paulbartell commented 8 months ago

Closing this issue due to inactivity. Feel free to re-open the issue if it persists.