Closed rohansingh closed 4 years ago
Thank you for catching this and reporting it in detail. Could we do pair programming here since it seems you dug deep in this topic?! Thanks in advance! So, it seems to me the resize happens but 3 bytes are not copied to the new place at realloc. The copy happens here.
Could you please print out the desc->capacity
and the desc->length
. I assume the latter is larger by 3 which means a corrupt iotc_data_desc_t desc
(capacity < lenght). The capacity wasn't updated correctly back somewhen.
But before we progress, could you prove this assumption by printing out the values? Thanks!
Sure, I'll try capturing those values and keep you posted.
Thanks, additionally capacity is some power of 2, and here the length is probably 2^n+3, but this is just an assumption, we will see.
It takes a bit of time for one of these incidents to occur, the frequency might be more like 1% rather than 5% of messages. Unfortunately it doesn't seem to be an issue with capacity not being updated. That was my first thought as well but seems like it's not the case.
I initially modified iotc_data_desc_realloc
to add some logging and breakpoints like you suggested. However, the conditional breakpoint was never hit, and it appeared that desc->capacity
was always larger than desc->length
.
But now, I've added logging and debugging logic to iotc_data_desc_append_data_resize
, as follows:
iotc_state_t iotc_data_desc_append_data_resize(iotc_data_desc_t* out,
const char* const data,
const size_t len) {
if (out == NULL || data == NULL) {
return IOTC_INVALID_PARAMETER;
}
char* orig = NULL;
char* preMemcpy = NULL;
iotc_state_t ret_state = IOTC_STATE_OK;
// preserve original contents of out->data_ptr
orig = malloc(out->length);
memcpy(orig, out->data_ptr, out->length);
// check null characters in original buffer
int hasNullCharacters =
(strnlen((const char*)out->data_ptr, out->length) != out->length);
if (hasNullCharacters != 0) {
// breakpoint 1
printf("orig %d %p %p %d %p\n", hasNullCharacters, out, data, len, orig);
}
printf("pre: length %d, capacity %d\n", out->length, out->capacity);
IOTC_CHECK_STATE(ret_state = iotc_data_desc_assure_buf_len(out, len));
printf("post: length %d, capacity %d\n", out->length, out->capacity);
// check for null chars after buffer resize
hasNullCharacters =
(strnlen((const char*)out->data_ptr, out->length) != out->length);
if (hasNullCharacters != 0) {
// breakpoint 2
printf("1 %d %p %p %d %p\n", hasNullCharacters, out, data, len, orig);
}
// preserve contents of out->data_ptr for debugging
preMemcpy = malloc(out->length);
memcpy(preMemcpy, out->data_ptr, out->length);
memcpy(out->data_ptr + out->length, data, len);
out->length += len;
// check for null chars after memcpy
hasNullCharacters =
(strnlen((const char*)out->data_ptr, out->length) != out->length);
if (hasNullCharacters != 0) {
// breakpoint 3
printf("2 %d %p %p %d %p\n", hasNullCharacters, out, data, len, orig);
}
err_handling:
free(preMemcpy);
free(orig);
return ret_state;
}
Now, somehow breakpoints 1 and 2 are not hit, but breakpoint 3 does get hit. When that happened, the following had been printed to the log:
pre: length 390, capacity 512
post: length 390, capacity 512
Additionally:
preMemcpy
buffer has 390 bytes of content without any null characters, as expected.out->data_ptr
has 388 of the bytes from preMemcpy
, two null characters, and all 32 of the bytes from data
.So it seems like sometimes two bytes are dropped instead of three. And it appears that the corruption is happening without the buffer being resized, possibly during memcpy
. I'm not even sure how that's possible. Perhaps my original statement that this is an SDK issue and not a chip issue was wrong.
Looks like this is actually an instance of espressif/esp-idf#3006, which is caused by an underlying silicon issue. Apologies for the false report.
It seems very likely that other users of this library will run into the same bug on ESP32, since the logic for handling MQTT payloads is likely to trigger the bug under load by quickly reading/writing to the same buffer. For anyone else experiencing this, the underlying issue is being tracked at espressif/esp-idf#2892. The main workaround for now is to run in unicore mode with CONFIG_FREERTOS_UNICORE
.
While running on an ESP32, I've noticed that some of the MQTT messages received by the device are corrupted. Specifically, some messages will have three consecutive bytes replaced with null characters. This happens fairly frequently, perhaps in around 5% of all messages being received by this application. It is definitely an issue with the SDK, and not with the messages being sent or the underlying board.
Using a debugger, I've narrowed this down specifically to
iotc_data_desc_append_data_resize
, when it is called byread_data
here: https://github.com/GoogleCloudPlatform/iot-device-sdk-embedded-c/blob/a26ee5ec79ad61eaf58461a1ef870f217e08b9d4/third_party/mqtt-protocol-c/iotc_mqtt_parser.c#L139-L148In the resulting pointer, new data from
src->data_ptr
is always successfully appended onto(*dst)->data_ptr
. But it appears that sometimes, before the new bytes are appended, the last three existing bytes of(*dst)->data_ptr
are replaced with nulls.As an example, here are some of the contents of
(*dst)->data_ptr
beforeiotc_data_desc_append_data_resize
is called:In the instance that I caught while debugging, we are attempting to append this new data from
src->data_ptr
:But after calling
iotc_data_desc_append_data_resize
, the new(*dst)->data_ptr
has its original last three bytes replaced with nulls:All the new bytes from
src->data_ptr
are copied successfully, and even the length of the message is correct. But those null characters corrupt the contents of the message.