FreeRTOS / iot-reference-esp32

MIT License
81 stars 49 forks source link

[BUG] OTA Job notification isn't received #117

Open Fienberber opened 2 months ago

Fienberber commented 2 months ago

I'm migrating from to the esp-aws-iot [202406.01-LTS] from [202210.01-LTS]. The previous integration was working properly so I know this is not an AWS issue.

Using the OTA example, I can start OTA's using polling. Using the requestJobDocumentHandler(); by sending an event at OtaAgentEventRequestJobDocument.

But previously I was subscribed to $aws/things/+/jobs/#, which notified when OTA where available. I don't want to use polling to start an OTA. But I don't achieve subscription to the jobs topic.

I think I'm missing a critical part of the example. Because it seems like it doesn't subscribe to the $aws/things/+/jobs/# topic.

aggarg commented 2 months ago

I think I'm missing a critical part of the example. Because it seems like it doesn't subscribe to the $aws/things/+/jobs/# topic.

Are you saying that your subscribe calls fails? Or the subscribe succeeds and you do not get a notification?

Fienberber commented 2 months ago

Thank you for your response,

Actually I thought I was subscribed but since I don't get the notification I tried to add the subscription manually. At this moment I get disconnected from the broker. In the example ota_over_mqtt_demo I don't see where the subscription take place. I see the request during the initialization, but afterward I don't see any pulling either.

That's why I think I'm missing something.

AniruddhaKanhere commented 1 month ago

@Fienberber, the demo doesn't actively subscribe to the MQTT topic. See the jobs docs on AWS. Clients are sent the messages from the broker even when they are not subscribed to it.

To that end, the demos 'inject' an OTA handler which handles unsolicited publishes. Thus, any topics not handled by coreMQTT-Agent (or manager) will be passed on to the OTA handler which verifies whether the job topic matches with what is expected and processes it.

You should not need to process the OTA events yourself. The OTA task should do that on its own by calling processOTAEvents. Can you elaborate on what is happening? Is the task not starting? Are you using the unmodified version of the demo?

Thanks

Fienberber commented 1 month ago

Hi @AniruddhaKanhere thank you for you response. I use a modified version of the demo in order to comply with infrastructure. However I get the OTA to work while polling for new job. I could have missed something when modifying the demo...

I enabled every level of logs for MQTT in the menuconfig. But I see no publish when I create a remote job on the AWS control panel. The OTA handler doesn't get anything either. That's why I tried to do I myself, in order to see if the remote server actually publishes something.

The task seem to running since it works properly at startup. Or am I mistaken ?

aggarg commented 1 month ago

Can you try to add a log here to dump pxPublishInfo->pTopicName and pxPublishInfo->pPayload. This will confirm if the OTA notification is forwarded to the OTA handler.

fjuliofontes commented 1 month ago

Hi @Fienberber , Did you find the solution? I'm also suffering from the same issue. Only when calling requestJobDocumentHandler() the OTA starts.

AniruddhaKanhere commented 1 month ago

@fjuliofontes why do you need to call the requestJobDocumentHandler function manually? These lines should take care of the same.

Can you please add a breakpoint there and check whether that line is hit and the step through to figure out why isn't that message sent/processed?

AniruddhaKanhere commented 1 month ago

Also, can you try creating the job before running the code on the device? I suggest, create the job on the console - wait for 30s, and then run the code on the device. That will verify whether there is some timing issue in the code in the repository.

@Fienberber and @fjuliofontes can you both please try the above?

fjuliofontes commented 1 month ago

@fjuliofontes why do you need to call the requestJobDocumentHandler function manually? These lines should take care of the same.

Can you please add a breakpoint there and check whether that line is hit and the step through to figure out why isn't that message sent/processed?

I'm not actually calling, it's during bootup that the function get's called. I was just pointing that it's the only way for the job to start, rebooting.

fjuliofontes commented 1 month ago

Also, can you try creating the job before running the code on the device? I suggest, create the job on the console - wait for 30s, and then run the code on the device. That will verify whether there is some timing issue in the code in the repository.

@Fienberber and @fjuliofontes can you both please try the above?

Thank you, as I described above the OTA will start at bootup if the job is already created.

By the way, the compiler was giving a warning about a falling-through on this line and I've added a break there.

Best regards,

Fienberber commented 1 month ago

Hello @AniruddhaKanhere thank you for your response. Sorry for the delay, I still have the issue. I added a task that pulls jobs in the meantime.

Also, can you try creating the job before running the code on the device? I suggest, create the job on the console - wait for 30s, and then run the code on the device. That will verify whether there is some timing issue in the code in the repository.

@Fienberber and @fjuliofontes can you both please try the above?

As @fjuliofontes said, if the Job was created before the software start, it will work. At init the OTA is polled once. The issue happen when the software is running and a job is send from AWS.

Can you try to add a log here to dump pxPublishInfo->pTopicName and pxPublishInfo->pPayload. This will confirm if the OTA notification is forwarded to the OTA handler.

I've added the log, I've already done it, and didn't saw anything anormal.

Software startup E (7155) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729752877}"

E (7176) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729754684,"execution":{"jobId":"AFR_OTA-****","status":"IN_PROGRESS","queuedAt":1729753816,"startedAt":1729754684,"lastUpdatedAt":1729754684,"versionNumber":2,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-****","files":[{"filepath":"/device/updates","filesize":1367568,"fileid":0,"certfile":"/certificates/authcert.pem","sig-sha256-ecdsa":"MEQ...0Q=="}]}}}}"
E (14776) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/streams/AFR_OTA-****/data/json", for Payload "{"f":0,"i":0,"l":4096,"p":"6QU...RQ=="}"
E (24727) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729755015,"execution":{"jobId":"AFR_OTA-****","status":"IN_PROGRESS","statusDetails":{"self_test":"ready"},"queuedAt":1729753816,"startedAt":1729754684,"lastUpdatedAt":1729754989,"versionNumber":3,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-****","files":[{"filepath":"/device/updates","filesize":1367568,"fileid":0,"certfile":"/certificates/authcert.pem","sig-sha256-ecdsa":"MEQ...0Q=="}]}}}}"
E (25737) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/AFR_OTA-****/update/accepted", for Payload "{"timestamp":1729755016}"
E (30777) ota_over_mqtt: vOTAProcessMessage Topis "$aws/things/DEVICE-ID/jobs/start-next/accepted", for Payload "{"clientToken":"test","timestamp":1729755021}"

I can try other things if necessary.

AniruddhaKanhere commented 1 month ago

@Fienberber, thank you for that debug log.

Can you try to add the following line here:

    switch( recvEventId )
    {
+       case OtaAgentEventStart:
        case OtaAgentEventRequestJobDocument:
            ESP_LOGI( TAG, "Request Job Document event Received \n" );

            requestJobDocumentHandler();
            otaAgentState = OtaAgentStateRequestingJob;
            break;

I think that would fix the issue as the OTA task calls the requestJobDocumentHandler repeatedly.

Fienberber commented 4 weeks ago

@AniruddhaKanhere thank you for your response.

Unfortunately, it doesn't change anything. The processOTAEvents ins't even called after the Job creation... I think the issue comes from the topic subscription.

AniruddhaKanhere commented 3 weeks ago

@Fienberber, I am not sure why doesn't these lines do not call processOTAEvents.

Can you please put a breakpoint there? The prvOTADemoTask should call it repeatedly.

As far as topic subscription is considered, the Jobs topic is a special topic to which IoT core publishes even without subscription as I highlighted above here.

Thanks for your patience.

Fienberber commented 3 weeks ago

@AniruddhaKanhere tank you for your response.

I've put a break point, actually it loops only 3 times, It it waiting for an event here.

The two previous events are : OtaAgentEventRequestJobDocument and then OtaAgentEventReceivedJobDocument as expected when the firmware starts.

But then it is just waiting for an event it never gets...

I don't know if it's expected but when OtaAgentEventReceivedJobDocument is received I run into this case (OtaPalJobDocFileCreateFailed). The message received is on topic **/jobs/start-next/accepted with the content {"clientToken":"test","timestamp":1730280274}

AniruddhaKanhere commented 3 weeks ago

Ah! I see the issue now. Can you please make the following change here?

/* Temp buffer.*/
    uint8_t buff[ sizeof( OtaEventMsg_t ) ];

-   retVal = xQueueReceive( otaEventQueue, &buff, portMAX_DELAY );
+   retVal = xQueueReceive( otaEventQueue, &buff, 10000 );

    if( retVal == pdTRUE )

The above should allow the loop to break in absence of any event. That way, when the OTA job is started, the device will ping IoT core.

Thank you for your patience as we try to figure out the core issue. Thanks!

Fienberber commented 3 weeks ago

Hi @AniruddhaKanhere, thank you, It did the trick !

I want to keep using the 202406.01-LTS-release so I made some quick modification to the example If you have suggestion I'd be happy to improve what I've done.

    static OtaEvent_t lastRecvEventIdBeforeSuspend = OtaAgentEventStart;
    OtaEventMsg_t nextEvent = { 0 };

-    OtaReceiveEvent_FreeRTOS( &recvEvent );

+    {
+        EventGroupHandle_t eventGroup = xEventGroupCreate();
+        BlockingTaskParams taskParams = {
+            .eventGroup = eventGroup,
+            .recvEvent = &recvEvent
+        };
+        TaskHandle_t blockingTaskHandle;
+        // Create the blocking task
+        xTaskCreate(OtaReceiveEvent_FreeRTOS_InTask, "OtaReceiveEvent_FreeRTOS_InTask", otaconfigOTA_TASK_STACK_SIZE, &taskParams, otaconfigOTA_TASK_PRIORITY, &blockingTaskHandle);
+
+        // Wait for completion or timeout
+        EventBits_t result = xEventGroupWaitBits(eventGroup, TASK_COMPLETED_BIT, pdTRUE, pdFALSE, pdMS_TO_TICKS(TIMEOUT_MS));
+
+        if (result & TASK_COMPLETED_BIT) {
+            // Blocking function completed in time
+        } else {
+            // Timeout occurred, handle error
+            ESP_LOGI( TAG, "No events during timeout, ping IoT core." );
+            vTaskDelete(blockingTaskHandle);
+        }
+
+        // Ensure cleanup of the event group after usage
+        vEventGroupDelete(eventGroup);
+    }

    recvEventId = recvEvent.eventId;

In addition to

// Structure to pass multiple parameters to the blocking task
typedef struct {
    EventGroupHandle_t eventGroup;
    OtaEventMsg_t * recvEvent;
} BlockingTaskParams;

void OtaReceiveEvent_FreeRTOS_InTask(void *params) {
    BlockingTaskParams *taskParams = (BlockingTaskParams *)params;
    EventGroupHandle_t eventGroup = taskParams->eventGroup;
    // Call of the blocking when no events function
    OtaReceiveEvent_FreeRTOS( taskParams->recvEvent );

    // If function returns before timeout, signal completion
    xEventGroupSetBits(eventGroup, TASK_COMPLETED_BIT);
    vTaskDelete(NULL);  // Delete this task when done
}
AniruddhaKanhere commented 6 days ago

@Fienberber, your solution works indeed.

But do note that creating a deleting a task is very resource intensive. That is why I suggested that you make the change xQueueReceive() call instead. We are working on adding this change in the upstream repo.

Thank you for your patience.

jvasanthan commented 5 days ago

@AniruddhaKanhere The changes you mentioned did not work. The timeout prints the following message. _E (105926) AWS_OTA: Failed to receive event from OTA Event Queue: xQueueReceive returned error: OtaOsStatust=130 but OTA is not triggered. I had to reset ESP for OTA to take place. What am I missing here.

AniruddhaKanhere commented 5 days ago

@jvasanthan, my above suggested changes are to be used with the previous changes mentioned here.

When both of these changes are made, then only OTA works. Let me know if that helps

jvasanthan commented 5 days ago

@AniruddhaKanhere after making those 2 changes, it works. Thankyou.