aws / ota-for-aws-iot-embedded-sdk

MIT License
60 stars 74 forks source link

[BUG] OTA Block size check wrong #503

Closed BogdanTheGeek closed 10 months ago

BogdanTheGeek commented 11 months ago

If a network disconnect occurs and the ota task is suspended and resumed, when the job description comes back, the library checks the pFileContext->blocksRemaining against OTA_MAX_BLOCK_BITMAP_SIZE (128) and silently stops the ota, without updating the job. The same issue doesn't happen if the device reboots instead of resuming the ota task. The issue also never manifests if the ota job is not suspended in the middle of job, so the number of blocks is not checked unless we get a new job document while already having one? The issue is 2 fold, the check is wrong, it should be(ota.c:2404):

    else if( pFileContext->blocksRemaining > (OTA_MAX_BLOCK_BITMAP_SIZE * BITS_PER_BYTE) )

Applying the change above fixes the issue. I can create a PR with the change if you can confirm that my understanding of the issue is correct.

The other is that the ota library fails silently and the job will not resume until the device reboots.

This is the log after a network reconnection:

I (105814) hota: Received job document: {...}
W (105074) awsota: Index: 3. OTA event id: 3
W (105080) awsota: OTA size (266 blocks) greater than can be tracked. Increase `OTA_MAX_BLOCK_BITMAP_SIZE`(128)
I (105090) awsota: Unable to initialize Job Parsing: OtaJobParseErr_t=OtaJobParseErrBadModelInitParams
I (105100) awsota: otaPal_GetPlatformImageState
E (105106) awsota: Failed to execute state transition handler: Handler returned error: OtaErr_t=OtaErrJobParserError
E (105117) awsota: Current State=[WaitingForJob], Event=[ReceivedJobDocument], New state=[CreatingFile]

This is all on the main branches of ota-for-aws-iot-embedded-sdk, coreMQTT, jobs etc.

BogdanTheGeek commented 11 months ago

the buggy check was introduce in #477, its also not doing what the PR is intending it to do.

bradleysmith23 commented 11 months ago

Hello @BogdanTheGeek and thank you for bringing this to our attention. You are indeed correct, the existing check is buggy. If you would like to create a PR to address that would be great! We will review it once it is created. As for the second part of the issue, we are continuing to investigate this and will update you with our findings.

BogdanTheGeek commented 10 months ago

@bradleysmith23 i've had a chance to look though the rest of the ota code.

The check is only happening in parseJobDoc(). I don't think pFileContext->blocksRemaining is initialised when the check is being done. From what i can see the call stack is this:

I would suggest that a better place to check if the number of blocks in the ota is valid would be in validateAndStartJob.

bradleysmith23 commented 10 months ago

@BogdanTheGeek thank you for your continued effort! I agree, you are correct that pfileContext->blocksRemaining is not initialized when the check is being done. validateAndStartJob() does seem to be an appropriate place for this check, thank you for the suggestion! If you would like to amend your PR (#504) with this change, I will review it. If not, I am happy to raise a PR for this issue as well.

BogdanTheGeek commented 10 months ago

@bradleysmith23 I will move the check the into validateAndStartJob()