Closed BogdanTheGeek closed 10 months ago
the buggy check was introduce in #477, its also not doing what the PR is intending it to do.
Hello @BogdanTheGeek and thank you for bringing this to our attention. You are indeed correct, the existing check is buggy. If you would like to create a PR to address that would be great! We will review it once it is created. As for the second part of the issue, we are continuing to investigate this and will update you with our findings.
@bradleysmith23 i've had a chance to look though the rest of the ota code.
The check is only happening in parseJobDoc()
.
I don't think pFileContext->blocksRemaining
is initialised when the check is being done.
From what i can see the call stack is this:
processJobHandler
-> pOtaFileContext == NULLgetFileContextFromJob
-> pOtaFileContext == NULLparseJobDoc
-> pOtaFileContext == pFileContext == NULLinitDocModel
-> Initialises the fileContext?blocksRemainng
actually gets set is after parseJobDoc
returns into getFileContextFromJob
: ota.c:2538
I would suggest that a better place to check if the number of blocks in the ota is valid would be in validateAndStartJob
.
@BogdanTheGeek thank you for your continued effort! I agree, you are correct that pfileContext->blocksRemaining
is not initialized when the check is being done. validateAndStartJob()
does seem to be an appropriate place for this check, thank you for the suggestion! If you would like to amend your PR (#504) with this change, I will review it. If not, I am happy to raise a PR for this issue as well.
@bradleysmith23 I will move the check the into validateAndStartJob()
If a network disconnect occurs and the ota task is suspended and resumed, when the job description comes back, the library checks the
pFileContext->blocksRemaining
againstOTA_MAX_BLOCK_BITMAP_SIZE
(128) and silently stops the ota, without updating the job. The same issue doesn't happen if the device reboots instead of resuming the ota task. The issue also never manifests if the ota job is not suspended in the middle of job, so the number of blocks is not checked unless we get a new job document while already having one? The issue is 2 fold, the check is wrong, it should be(ota.c:2404):Applying the change above fixes the issue. I can create a PR with the change if you can confirm that my understanding of the issue is correct.
The other is that the ota library fails silently and the job will not resume until the device reboots.
This is the log after a network reconnection:
This is all on the main branches of ota-for-aws-iot-embedded-sdk, coreMQTT, jobs etc.