Closed imammy-hacomono closed 1 year ago
@dhavalgujar Do you know of any solution, or could you please tell me someone who does? Thank you...
I don't use this OTA agent code -- but I suspect that the self test is failing. You'll need to dig down into the routines and examine what the self test is trying to verify. Typically this is something like taking control of a GPIO and toggling it and looking at the output to verify that the new software is in fact working. Everyone's hardware can be different, so make sure the standard self-test is in fact really working correctly for your hardware. If not, you'll always be failing and the system will always want to do a roll back. I write my own self-test in the main and immediately make that determination for the OTA process. This is what is typically seen in the OTA examples.
@SolidStateLEDLighting First of all, thank you for your response! It appears that inSelfTestHandler is failing, and it is returning false in the platformInSelftest() function within this function.
static OtaErr_t inSelfTestHandler(const OtaEventData_t* pEventData) {
OtaErr_t err = OtaErrNone;
(void) pEventData;
LogInfo(("Beginning self-test."));
/* Check the platform's OTA update image state. It should also be in self test. */
if (platformInSelftest() == true) {
/* Callback for application specific self-test. */
callOtaCallback(OtaJobEventStartTest, NULL);
/* Clear self-test flag. */
otaAgent.fileContext.isInSelfTest = false;
/* Stop the self test timer as it is no longer required. */
(void) otaAgent.pOtaInterface->os.timer.stop(OtaSelfTestTimer);
} else {
/* The job is in self test but the platform image state is not so it could be
* an attack on the platform image state. Reject the update (this should also
* cause the image to be erased), aborting the job and reset the device. */
LogWarn(("Rejecting new image and rebooting:"
"The job is in the self-test state while the platform is not."));
err = setImageStateWithReason(OtaImageStateRejected, (uint32_t) OtaErrImageStateMismatch);
(void) otaAgent.pOtaInterface->pal.reset(&(otaAgent.fileContext));
}
if (err != OtaErrNone) {
LogError(("Failed to start self-test: "
"OtaErr_t=%s",
OTA_Err_strerror(err)));
}
return err;
}
In platformInSelftest(), it seems to be checking whether the State is "OtaPalImageStatePendingCommit". Since the State is not "OtaPalImageStatePendingCommit" and it is returning false, it seems that the self-test is not even starting and is failing.
Do you need to implement changing the State to "OtaPalImageStatePendingCommit" yourself? Does anyone know how to do this?
static bool platformInSelftest(void) {
bool selfTest = false;
/*
* Get the platform state from the OTA pal layer.
*/
if (otaAgent.pOtaInterface->pal.getPlatformImageState(&(otaAgent.fileContext)) == OtaPalImageStatePendingCommit) {
selfTest = true;
}
return selfTest;
}
Can someone support me?
This code looks like it is examining the PlatformImageState and comparing to a constant of OtaPalImageStatePendingCommit. If true, then a return value of true is set.
The big problem is that overly-complicated software code needs documentation and diagrams to teach developers what is going on behind the scenes. Who does that today? -- not many. The problem is further worsened by the fact that Espressif is pulling code from others like AWS. Now documentation is even harder to come by.
If you know the download is good -- for now -- just return true and test the results.
Then, write your own testing code based on your hardware for future down-loads.
I have submitted an issue to ota-for-aws-iot-embedded-sdk. Hopefully I can get some answers that will shed some light...
pal stands for Platform Abstraction Layer. I suggest you follow that function call getPlatformImageState down until you understand how that routine is going to return OtaPalImageStatePendingCommit.
Seeing as how all hardware can be different, apparently the standard hardware testing procedure doesn't match your hardware?
The ota-for-aws-iot-embedded-sdk is provided by AWS not Espressif. I would not put too much hope in them because Espressif is breaking away from AWS this is why you are here at esp-aws-iot rather than at aws-iot-embedded-device-sdk-embedded-c.
I was first introduced to the aws-iot-embedded-device-sdk-embedded-c about 2 years back (when Espressif didn't have an answer for AWS) -- and then over time had to discover for myself through indirect observation that Espressif wasn't endorsing that anymore -- they took what they wanted from AWS to produce the aws-iot-sdk.
I have done almost everything possible in aws iot (MQTT, Provisioning, HTTP Rest, Shadow, Jobs, OTA) -- all with the aws-iot-sdk. If you examine how the sample projects work, you can simplify things quite a bit. The AWS sdks include so much for every hardware platform (other than esp) that it's a bit daunting to absorb.
This is no cake walk -- you'll need to put in some significant effort.
The problem to begin with is the following code.
typedef esp_ota_select_entry_t ota_select;
typedef struct {
uint32_t ota_seq;
uint8_t seq_label[20];
uint32_t ota_state;
uint32_t crc; /* CRC32 of ota_seq field only */
} esp_ota_select_entry_t;
static const esp_partition_t* _esp_get_otadata_partition(uint32_t* offset, ota_select* entry, bool active_part) {
esp_err_t ret;
const esp_partition_t* find_partition = NULL;
spi_flash_mmap_handle_t ota_data_map;
const void* result = NULL;
ota_select s_ota_select[2];
find_partition = esp_partition_find_first(ESP_PARTITION_TYPE_DATA, ESP_PARTITION_SUBTYPE_DATA_OTA, NULL);
if (find_partition != NULL) {
ret = esp_partition_mmap(find_partition, 0, find_partition->size, SPI_FLASH_MMAP_DATA, &result, &ota_data_map);
if (ret != ESP_OK) {
ESP_LOGW(TAG, "mmap failed %d", ret);
return NULL;
} else {
memcpy(&s_ota_select[0], result, sizeof(ota_select));
memcpy(&s_ota_select[1], result + SPI_FLASH_SEC_SIZE, sizeof(ota_select));
spi_flash_munmap(ota_data_map);
}
uint32_t gen_0_seq = ota_select_valid(&s_ota_select[0]) ? s_ota_select[0].ota_seq : 0;
uint32_t gen_1_seq = ota_select_valid(&s_ota_select[1]) ? s_ota_select[1].ota_seq : 0;
ESP_LOG_BUFFER_HEXDUMP(TAG, &s_ota_select[0], sizeof(ota_select), ESP_LOG_INFO);
ESP_LOG_BUFFER_HEXDUMP(TAG, &s_ota_select[1], sizeof(ota_select), ESP_LOG_INFO);
ESP_LOGI(TAG, "gen_0_seq:%ld, gen_1_seq:%ld", gen_0_seq, gen_1_seq);
if (gen_0_seq == 0 && gen_1_seq == 0) {
ESP_LOGW(TAG, "otadata partition is invalid, factory/ota_0 is boot partition");
memcpy(entry, &s_ota_select[0], sizeof(ota_select));
*offset = 0;
} else if ((gen_0_seq >= gen_1_seq && active_part) || (gen_1_seq > gen_0_seq && !active_part)) {
memcpy(entry, &s_ota_select[0], sizeof(ota_select));
*offset = 0;
ESP_LOGI(TAG, "[0] aflags/seq:0x%" PRIx32 "/0x%" PRIx32 ", pflags/seq:0x%" PRIx32 "/0x%" PRIx32 "",
s_ota_select[0].ota_state, gen_0_seq, s_ota_select[1].ota_state, gen_1_seq);
} else {
memcpy(entry, &s_ota_select[1], sizeof(ota_select));
*offset = SPI_FLASH_SEC_SIZE;
ESP_LOGI(TAG, "[1] aflags/seq:0x%" PRIx32 "/0x%" PRIx32 ", pflags/seq:0x%" PRIx32 "/0x%" PRIx32 "",
s_ota_select[1].ota_state, gen_1_seq, s_ota_select[0].ota_state, gen_0_seq);
}
} else {
ESP_LOGE(TAG, "no otadata partition found");
}
return find_partition;
}
esp_err_t aws_esp_ota_get_boot_flags(uint32_t* flags, bool active_part) {
const esp_partition_t* part = NULL;
uint32_t offset;
ota_select entry;
ESP_LOGI(TAG, "%s: %d", __func__, active_part);
*flags = ESP_OTA_IMG_INVALID;
part = _esp_get_otadata_partition(&offset, &entry, active_part);
if (part == NULL) {
return ESP_FAIL;
}
*flags = entry.ota_state;
return ESP_OK;
}
The problem is that the aflags/pflags, or ota_state, is still 0xffffffffff. This is the value of n internal FLASHn in ESP32. (Partition: OTA information in otadata) There are many different types of ESPs, but, can the behavior change depending on the hardware?
Notice that there are almost no comments in the code -- you can see what he is doing, but you'll never know why he is doing it. There is no strategy being described in this file. To me, this is a major disappointment. I don't consider this good engineering.
Testing a new OTA download is about making the circuit function to determine if the code is running. Examples in this library show how pins were being pulled up/down and then examine to verify that code was reading the pins correctly. In this manor -- all circuits are different -- and if your pins are held low or high -- the canned default test may be failing. You'll need to find the testing code and see what pins are being manipulated for the test.
@avsheth @dhavalgujar Hi. Somehow, PAL has not set the correct image state. Do you have any views on this or not?
I have already given you the "big picture" view. I don't think you understand it.
We found the cause. It was because CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE was not enabled. I had thought it was just a setting to enable rollback, so I had disabled it first, But I see that by enabling it, the function to update the ota_state flag in the boot loader is also enabled. I learned a great deal. https://docs.espressif.com/projects/esp-idf/en/v4.2/esp32/api-reference/system/ota.html#app-rollback
I have not dealt with that before. I'm not sure why someone would want to disable a roll-back, but there must be a reason for it.
I cut out all that code to keep everything simple. In my build CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE is not set. And my firmware doesn't look for it.
Without extensive documentation and understanding -- all these little things become land-mines.
Sorry that my advice wasn't more helpful to you.
K.
From: imammy @.> Sent: Wednesday, April 12, 2023 2:49 PM To: espressif/esp-aws-iot @.> Cc: keith ssledlighting.com @.>; Mention @.> Subject: Re: [espressif/esp-aws-iot] MQTT OTA problem returning FAILED to OTA JOB after starting with new FW. (CA-286) (Issue #177)
We found the cause. It was because CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE was not enabled. I had thought it was just a setting to enable rollback, so I had disabled it first, But I see that by enabling it, the function to update the ota_state flag in the boot loader is also enabled. I learned a great deal. https://docs.espressif.com/projects/esp-idf/en/v4.2/esp32/api-reference/system/ota.html#app-rollback
— Reply to this email directly, view it on GitHubhttps://github.com/espressif/esp-aws-iot/issues/177#issuecomment-1504753915, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGGOKE24QECL6CYXZMZBDYDXAZGANANCNFSM6AAAAAAWY75DBE. You are receiving this because you were mentioned.Message ID: @.***>
Every little thing is a minefield. This is so true. Let's hope this ISUSE is useful to someone! Thanks for your cooperation!
I tried running the MQTT OTA sample using the Cellular interface Library. When I executed the job, block transfer started using MQTT communication. After receiving all the blocks, the device restarted, and I confirmed that the app version had been updated to a new one. However, for the OTA job, it published status: FAILED, and it also showed as failed on the AWS console.
I suspect that the cause of this issue is that the values of aflag and pflag read from _esp_get_otadata_partition are both 0xffffffff. However, I am just running the sample as is. Do I need any additional implementation?