espressif / esp-aws-iot

AWS IoT SDK for ESP32 based chipsets
Apache License 2.0
266 stars 157 forks source link

OTA not working with long(er) thing names (CA-250) #147

Closed LozGMK closed 1 year ago

LozGMK commented 1 year ago

Hi all,

I've been trying to test (and incorporate into some existing AWS IoT code), the OTA code in this repo. That is a PlatformIO compatible version of the repo here, so only differs from this repo in ways that allow it to be developed and built in VS Code / PlatformIO. What I'm trying to say is that I don't think the changes in that repo are causing the issue I'm seeing, so it is essentially the same as the code in this repo, but slightly modified so it works in that environment.

To start with, I got the OTA code working, with a manually created thing on AWS. That thing happened to have a short name, ota_test_esp32 (14 chars).

The issue occurred when I tried to use the OTA code with existing things that had longer names, as required by the provisioning/registration process we use for those things. The longer names are of the form [AWS region name]:[User account ID, generated by AWS]:[a UUID, generated by AWS], and end up being about 80-90 characters in total (depending on region name). When tried with those things, the OTA code failed to return file data, in blocks, from a stream, on the .../data/cbor topic (described more below).

While 80 characters is over the (64-byte) default length specified in otaconfigMAX_THINGNAME_LEN (which is the most relevant setting I can find), increasing the value set in that #define to 128 didn't help (and, as explained below, the thing name can be much less than the default 64 bytes and still cause the OTA update to fail).

To test if it was the thing name length that was the issue, I started with OTA code that fully worked, hard-coded (via certificate and key saved in client.crt and client.key files) to a manually-created thing (with appropriate permissions via policies, etc.). I then made things with increasingly long "test" names (but otherwise with the same policies, settings, etc.), until the OTA failed to work properly.

The longest thing name I got to, that still worked, was eu-west-2:0123456:0123456 (25 chars). Increasing the name length by a couple of characters, to eu-west-2:01234567:01234567 caused the OTA code to work initially, but then fail, and increasing further caused it to not get any data blocks at all (described more below).

Just in case the colons were a problem, I tested some names with colons and some with underscores (e.g. eu-west-2_01234567_01234567), and the results were the same as with colons, in each case (for the same name length).

With the thing name of eu-west-2_01234567_01234567 (or eu-west-2:01234567:01234567), i.e. 27 chars, the symptoms were that the initial MQTT parts would work:

But then (with a thing name length of 27 chars), when the number of received blocks reached 24, the data blocks stop arriving. Messages are still intermittently published to topic .../get/cbor, but nothing else arrives on .../data/cbor.

When the thing name is shorter (25 chars or fewer), all of the blocks arrive, and the OTA code works as expected. When the thing name is longer (29 chars or more), then none of the data blocks arrive at all - everything else works, up to the point where topics are subscribed to and messages published, and the job and stream name are found, but no data messages are every received.

Does anyone have any thoughts on this?

Has anyone managed to make this OTA update code work with longer thing names than I've seen in examples? I've had a really good look around in the code and in examples, but have found no lead on what might be causing this.

Thank you for any help you can give.

SolidStateLEDLighting commented 1 year ago

3 thoughts... I'm not sure if these will help you.

Make sure your task stack memory is sufficient.

Next, make sure your receive buffer is sufficient. (probably is if you also fleet provisioning with it).

Why the strange job name of AFR_OTA-c9df...bc76? That strikes me as unusual.... just curious.

LozGMK commented 1 year ago

Thank you for the thoughts.

I've had a stack monitor running during all of these processes, reporting back the "high water mark" for the various tasks, using uxTaskGetStackHighWaterMark(). I don't think that showed anything running low on memory, but I'll double check that it's not occurring for the ones that failed.

I'll check the receive buffer for the OTA actions is large enough. I'm kind of running two versions of the libraries in the same project, just to get this working, so the provisioning/registration and "normal" IoT MQTT actions is done using older libraries, and the OTA update parts using newer libraries (which I'm less familiar with, as it seemed to get an order of magnitude more complex and obfuscated with the newer MQTT libraries, for no particular reason). So the buffer might be OK for the prov/reg bits, but too limited for the OTA.

The AFR_OTA-c9df...bc76 was a stream name (rather than a job name), and I just didn't want to include the whole thing, as it's quite long, so I truncated the middle section.

SolidStateLEDLighting commented 1 year ago

I just moved all my work into the component architecture (CMake driven under the components directory). I had to because my older project would not compile/link nicely with the newest release of the library (I also moved from aws-iot-device-sdk-embedded-c over to esp-aws-iot at the same time). Everything is running nicely as far as I can tell.

LozGMK commented 1 year ago

Yes, I'd like to have everything running in the newer (esp-aws-iot) libraries, but I haven't managed to wrap my head around them yet, so I'm just going for "something that does the job, with minimal changes" for now. Since all of the functions and structures seem to have changed quite a lot (between aws-iot-device-sdk-embedded-c and esp-aws-iot), I feel like it might take a bit of time to port the older code to using the newer libraries.

SolidStateLEDLighting commented 1 year ago

My original project was written for the March 2021 aws-iot-device-sdk-embedded-c.

I then went to port it to the newer August 2021 aws-iot-device-sdk-embedded-c, but I had problems.

I realized that Espressif had moved their work into the esp-aws-iot. I believe you'll notice that all the "issues" from the embedded-c sdk were also moved over to the esp-aws-iot github repository. (only 3 issues remain at aws-iot-device-sdk-embedded-c now).

I went ahead and embraced the esp-aws-iot libraries -- and also componentized my work at the same time.

The original AWS embedded C libraries were moving in the direction of OpenSSL -- and Expressif doesn't want that in their builds -- they are embracing TLS libraries (esp_tls?). So, a few things will need to be changed in that area. Just follow the examples in the TLS connection step for the patterns that apply.

I write in C++ and a few of the esp-aws-iot libraries are missing name-mangling "extern C" declarations (won't link). Other than this, the library should work.

The new build will be smaller by about 10k.

I don't have any problems with long names, but I keep my name kind of short anyhow.

K.


From: LozGMK @.> Sent: Tuesday, November 1, 2022 11:29 PM To: espressif/esp-aws-iot @.> Cc: keith ssledlighting.com @.>; Comment @.> Subject: Re: [espressif/esp-aws-iot] OTA not working with long(er) filenames (CA-250) (Issue #147)

Yes, I'd like to have everything running in the newer (esp-aws-iot) libraries, but I haven't managed to wrap my head around them yet, so I'm just going for "something that does the job, with minimal changes" for now. Since all of the functions and structures seem to have changed quite a lot (between aws-iot-device-sdk-embedded-c and esp-aws-iot), I feel like it might take a bit of time to port the older code to using the newer libraries.

— Reply to this email directly, view it on GitHubhttps://github.com/espressif/esp-aws-iot/issues/147#issuecomment-1298707686, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGGOKE3BQABQXUVAAEPZUNLWGEZPNANCNFSM6AAAAAARS7ZPNE. You are receiving this because you commented.Message ID: @.***>

LozGMK commented 1 year ago

Yes, I think it would be a good idea for me to move this project into the newer esp-aws-iot libraries, but I have run out of time to get this working, so I am looking to get something that "just works" for now, and at least once the OTA parts are working, I can update it all later.

It's just a shame that so much changed, all at once, moving from aws-iot-device-sdk-embedded-c to esp-aws-iot, seemingly abandoning the previous way of working for a very different style.

So it sounds like you have tried longer thing names and didn't have any trouble?

As I say, some of mine are about 80 to 90 chars long, and that can't be changed without having to re-do a lot of what's already been done, and by other people...

Thank you for the information, anyway. It's good to know that at least one person has managed to make it all work - gives me hope that I can get there when/if time allows!

SolidStateLEDLighting commented 1 year ago

No, I have not experimented with any long file names. As a habit I try to keep things short as possible while still being descriptive enough.

K.


From: LozGMK @.> Sent: Thursday, November 3, 2022 11:49 PM To: espressif/esp-aws-iot @.> Cc: keith ssledlighting.com @.>; Comment @.> Subject: Re: [espressif/esp-aws-iot] OTA not working with long(er) filenames (CA-250) (Issue #147)

Yes, I think it would be a good idea for me to move this project into the newer esp-aws-iot libraries, but I have run out of time to get this working, so I am looking to get something that "just works" for now, and at least once the OTA parts are working, I can update it all later.

It's just a shame that so much changed, all at once, moving from aws-iot-device-sdk-embedded-c to esp-aws-iot, seemingly abandoning the previous way of working for a very different style.

So it sounds like you have tried longer filenames and didn't have any trouble?

As I say, some of mine are about 80 to 90 chars long, and that can't be changed without having to re-do a lot of what's already been done, and by other people...

Thank you for the information, anyway. It's good to know that at least one person has managed to make it all work - gives me hope that I can get there when/if time allows!

— Reply to this email directly, view it on GitHubhttps://github.com/espressif/esp-aws-iot/issues/147#issuecomment-1302313123, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGGOKEZZFDDHPV77KVSHYWTWGPNH3ANCNFSM6AAAAAARS7ZPNE. You are receiving this because you commented.Message ID: @.***>

LozGMK commented 1 year ago

Ah, OK, so it might be the same for you if your names were longer.

By the way, I have tried tripling the size of the stack memory for the OTA task, and have doubled the size of the receive buffer, as well as reducing the number of blocks sent from a job stream to one at a time (rather than eight per publish/request) and reduced the data block size from 4096 to 1024 bytes, to greatly reduce the "load" on the RX buffers.

I still get the same symptoms:

SolidStateLEDLighting commented 1 year ago

The only other thing that comes to mind as a test -- would be to somehow reduce is in size another area of the complete transmitted string, and then raise up the name length again to 25 and see you can still receive OTA packets. If so, this points you in another direction other than looking strictly at the length of the thing name.

Also, I trust that you have looked directly into the console to see what the thing name is looking like in the system? Max length there is 128 character based on a search for that limit.

K.


From: LozGMK @.> Sent: Friday, November 4, 2022 12:22 AM To: espressif/esp-aws-iot @.> Cc: keith ssledlighting.com @.>; Comment @.> Subject: Re: [espressif/esp-aws-iot] OTA not working with long(er) filenames (CA-250) (Issue #147)

Ah, OK, so it might be the same for you if your names were longer.

By the way, I have tried tripling the size of the stack memory for the OTA task, and have doubled the size of the receive buffer, as well as reducing the number of blocks sent from a job stream to one at a time (rather than eight per publish/request) and reduced the data block size from 4096 to 1024 bytes, to greatly reduce the "load" on the RX buffers.

I still get the same symptoms:

— Reply to this email directly, view it on GitHubhttps://github.com/espressif/esp-aws-iot/issues/147#issuecomment-1302356936, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGGOKE7EFRKZZZA74NQE5JLWGPREPANCNFSM6AAAAAARS7ZPNE. You are receiving this because you commented.Message ID: @.***>

LozGMK commented 1 year ago

Thank you very much for the extra thoughts.

While using the thing name with 27 chars, I tried reducing the length of the message that was published to AWS.

To find an easy way to do that, I changed a string constant that appears to be sent in all messages published to .../get/cbor, to get file data from a job. That is the "Arbitrary client token sent in the stream "GET" message", #defined in OTA_CLIENT_TOKEN, as the string constant "rdy", in file ota_mqtt.c. I reduced that (by two characters) to the being "X" (an arbitrary, but easily-spotted 1 char string).

While using a thing name with 27 chars worked for obtaining the first 24 blocks, then failed to get any more, now it worked as well as a thing with a name of 25 chars. So reducing the published message by 2 chars allowed me to increase the thing name by 2 chars and have the OTA process work.

I initially thought this was due to a TX buffer being overflowed (when publishing to an AWS topic), and stopping the message from getting to AWS (so no data was returned), but I think it may instead be the other way around, as that "rdy" (or "X") string is also sent back in the the response message from AWS, so it would cause a problem if the response was too long for the buffer it went into.

I increased the value for the #defined OTA_NETWORK_BUFFER_SIZE, which is described as "Size of the network buffer to receive the MQTT message.". That's specified as being "otaconfigFILE_BLOCK_SIZE + extra for headers", or otaconfigFILE_BLOCK_SIZE + 128.

It would seem that the 128-byte "extra for headers" is only extra enough for a thing with a short name (of 25 chars, or fewer).

Increasing that to 256 allowed the longer thing names we are using to work with this OTA demo.

As a side note, having a long thing name also breaks the initial publish to the .../jobs/notify-next topic, as the message assembled for sending to AWS (also in the ota_mqtt.c file) looks like {"clientToken":"[value_of_reqCounter]:[thing_name]"}. That string is assembled from parts in requestJob_Mqtt().

AWS complains (and does not send back the requested data) if a client token longer than 64 characters is used, so having a long thing name in there, along with the decimal value of reqCounter (in a string, reqCounterString), and the colon, as well as the thing name, means that a thing name of 64 - 10 -1 = 53 chars could be enough to upset that part of the process.

SolidStateLEDLighting commented 1 year ago

Ok.. looks you have solved the mystery. Nice work.

A search turns up the maximum thing name at 128 characters. Still, the general idea would be to keep things a small and reasonably possible.

I use small string names that include the 6-character unique device MAC address, so that is my way of keeping the name small.

K.


From: LozGMK @.> Sent: Monday, November 7, 2022 11:33 PM To: espressif/esp-aws-iot @.> Cc: keith ssledlighting.com @.>; Comment @.> Subject: Re: [espressif/esp-aws-iot] OTA not working with long(er) filenames (CA-250) (Issue #147)

Thank you very much for the extra thoughts.

While using the thing name with 27 chars, I tried reducing the length of the message that was published to AWS.

To find an easy way tot do that, I changed a string constant that appears to be sent in all messages published to .../get/cbor, to get file data from a job. That is the "Arbitrary client token sent in the stream "GET" message", #defined in OTA_CLIENT_TOKEN, as the string constant "rdy", in file ota_mqtt.c. I reduced that (by two characters) to the being "X" (an arbitrary, but easily-spotted 1 char string).

While using a thing name with 27 chars worked for obtaining the first 24 blocks, then failed to get any more, now it worked as well as a thing with a name of 25 chars. So reducing the published message by 2 chars allowed me to increase the thing name by 2 chars and have the OTA process work.

I initially thought this was due to a TX buffer being overflowed (when publishing to an AWS topic), and stopping the message from getting to AWS (so no data was returned), but I think it may instead be the other way around, as that "rdy" (or "X") string is also sent back in the the response message from AWS, so it would cause a problem if the response was too long for the buffer it went into.

I increased the value for the #defined OTA_NETWORK_BUFFER_SIZE, which is described as "Size of the network buffer to receive the MQTT message.". That's specified as being "otaconfigFILE_BLOCK_SIZE + extra for headers", or otaconfigFILE_BLOCK_SIZE + 128.

It would seem that the 128-byte "extra for headers" is only extra enough for a thing with a short name (of 25 chars, or fewer).

Increasing that to 256 allowed the longer thing names we are using to work with this OTA demo.

As a side note, having a long thing name also breaks the initial publish to the .../jobs/notify-next topic, as the message assembled for sending to AWS (also in the ota_mqtt.c file) looks like {"clientToken":"[value_of_reqCounter]:[thing_name]"}. That string is assembled from parts in requestJob_Mqtt().

AWS complains (and does not send back the requested data) if a client token longer than 64 characters is used, so having a long file name in there, along with the decimal value of reqCounter (in a string, reqCounterString), and the colon, as well as the thing name, means that a thing name of 64 - 10 -1 = 53 chars could be enough to upset that part of the process.

— Reply to this email directly, view it on GitHubhttps://github.com/espressif/esp-aws-iot/issues/147#issuecomment-1305786963, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGGOKEZ2MRGBWRDJVMM52LLWHEOMXANCNFSM6AAAAAARS7ZPNE. You are receiving this because you commented.Message ID: @.***>

LozGMK commented 1 year ago

Yes, AWS say that the thing name can be up to 128 characters, though the OTA code gives a default, for otaconfigMAX_THINGNAME_LEN of 64 (as mentioned in this issue: [Feature Request] Misleading default config for otaconfigMAX_THINGNAME_LEN), so that needs increasing for longer names, too.

I, too, would like to use shorter names, but I'm rather stuck with what I've inherited for this project.

I think the reason for that was that we found something, along the way, that stated that ESP32s did not have guaranteed UUIDs in the MAC addresses, so other things were used to try and ensure uniqueness. I can't find the page that stated that was the case now, though...

LozGMK commented 1 year ago

Thank you again for your help, Keith.

Just in case anyone else comes across this, and it's useful, I also found that there was an issue that stopped an earlier part of the process working.

In .../ota-for-aws-iot-embedded-sdk/source/ota_mqtt.c, the thing name is added into string pPayloadParts[ 3 ], and makes up part of a message published to AWS topic $aws/things/[thing_name]/jobs/notify-next, that ends up something like {"clientToken":"[value_of_reqCounter]:[thing_name]"}. Due to the decimal value of reqCounter, which can be up to 10 characters, and the colon, the thing name can only be 64 - 10 -1 = 53 chars before there is (potentially) a problem with this request, as AWS doesn't allow a client token longer than 64 characters.

For that, I just put some other, short, fixed string in the client token instead of the thing name, as the value of reqCounter should be sufficiently unique to tie the request and response together, if needed.

Also, the value of reqCounter is turned into a decimal string using function stringBuilderUInt32Decimal(), which returns an empty string when it is sent a value of 0, rather than "0", as might be expected (giving a message of {"clientToken":":[thing_name]"}, with no decimal value before the (second) colon). That function is used elsewhere, too, and may cause similar issues when making strings such as versionMajorString, versionMinorString and versionBuildString.

The reason for that function being there, rather than using snprintf() or similar, seems to be that use of snprintf() has been eliminated (at least from that file), according to issue/pull Eliminate use of snprintf().

SolidStateLEDLighting commented 1 year ago

You're welcome. Happy to help!


From: LozGMK @.> Sent: Tuesday, November 8, 2022 5:44 PM To: espressif/esp-aws-iot @.> Cc: keith ssledlighting.com @.>; Comment @.> Subject: Re: [espressif/esp-aws-iot] OTA not working with long(er) thing names (CA-250) (Issue #147)

Thank you again for your help, Keith.

Just in case anyone else comes across this, and it's useful, I also found that there was an issue that stopped an earlier part of the process working.

In .../ota-for-aws-iot-embedded-sdk/source/ota_mqtt.c, the thing name is added into string pPayloadParts[ 3 ], and makes up part of a message published to AWS topic $aws/things/[thing_name]/jobs/notify-next, that ends up something like {"clientToken":"[value_of_reqCounter]:[thing_name]"}. Due to the decimal value of reqCounter, which can be up to 10 characters, and the colon, the thing name can only be 64 - 10 -1 = 53 chars before there is (potentially) a problem with this request, as AWS doesn't allow a client token longer than 64 characters.

For that, I just put some other, short, fixed string in the client token instead of the thing name, as the value of reqCounter should be sufficiently unique to tie the request and response together, if needed.

Also, the value of reqCounter is turned into a decimal string using function stringBuilderUInt32Decimal(), which returns an empty string when it is sent a value of 0, rather than "0", as might be expected (giving a message of {"clientToken":":[thing_name]"}, with no decimal value before the (second) colon). That function is used elsewhere, too, and may cause similar issues when making strings such as versionMajorString, versionMinorString and versionBuildString.

The reason for that function being there, rather than using snprintf() or similar, seems to be that use of snprintf() has been eliminated (at least from that file), according to issue/pull Eliminate use of snprintf()https://github.com/aws/ota-for-aws-iot-embedded-sdk/pull/107.

— Reply to this email directly, view it on GitHubhttps://github.com/espressif/esp-aws-iot/issues/147#issuecomment-1306920632, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGGOKE22FGWFWGUPOXR2GB3WHIOJXANCNFSM6AAAAAARS7ZPNE. You are receiving this because you commented.Message ID: @.***>