GoogleCloudPlatform / iot-device-sdk-embedded-c

Cloud IoT Device SDK for Connectivity to IoT Core.
Other
247 stars 83 forks source link

MQTT message content is corrupted by iotc_data_desc_append_data_resize #90

Closed rohansingh closed 4 years ago

rohansingh commented 4 years ago

While running on an ESP32, I've noticed that some of the MQTT messages received by the device are corrupted. Specifically, some messages will have three consecutive bytes replaced with null characters. This happens fairly frequently, perhaps in around 5% of all messages being received by this application. It is definitely an issue with the SDK, and not with the messages being sent or the underlying board.

Using a debugger, I've narrowed this down specifically to iotc_data_desc_append_data_resize, when it is called by read_data here: https://github.com/GoogleCloudPlatform/iot-device-sdk-embedded-c/blob/a26ee5ec79ad61eaf58461a1ef870f217e08b9d4/third_party/mqtt-protocol-c/iotc_mqtt_parser.c#L139-L148

In the resulting pointer, new data from src->data_ptr is always successfully appended onto (*dst)->data_ptr. But it appears that sometimes, before the new bytes are appended, the last three existing bytes of (*dst)->data_ptr are replaced with nulls.

As an example, here are some of the contents of (*dst)->data_ptr before iotc_data_desc_append_data_resize is called:

  Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F   
3ffebb10:                                     7B 22 61 70               {"ap
3ffebb20: 70 49 44 22 3A 22 6F 58 76 43 78 46 71 34 4F 4B   pID":"oXvCxFq4OK
...
3ffebc90: 54 51 30 67 49 59 47 30 4E 41 41 6D 6D 63 41      TQ0gIYG0NAAmmcA

In the instance that I caught while debugging, we are attempting to append this new data from src->data_ptr:

  Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F   
3f814f00: 41 41 44 2F 2F 36 6E 61 72 64 77 41 4B 47 68 58   AAD//6nardwAKGhX
3f814f10: 41 41 41 41 41 45 6C 46 54 6B 53 75 51 6D 43 43   AAAAAElFTkSuQmCC

But after calling iotc_data_desc_append_data_resize, the new (*dst)->data_ptr has its original last three bytes replaced with nulls:

  Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F   
3f81cd90:                                     7B 22 61 70               {"ap
3f81cda0: 70 49 44 22 3A 22 6F 58 76 43 78 46 71 34 4F 4B   pID":"oXvCxFq4OK
...
3f81cf10: 54 51 30 67 49 59 47 30 4E 41 41 6D 00 00 00 41   TQ0gIYG0NAAm...A
3f81cf20: 41 44 2F 2F 36 6E 61 72 64 77 41 4B 47 68 58 41   AD//6nardwAKGhXA
3f81cf30: 41 41 41 41 45 6C 46 54 6B 53 75 51 6D 43 43      AAAAElFTkSuQmCC

All the new bytes from src->data_ptr are copied successfully, and even the length of the message is correct. But those null characters corrupt the contents of the message.

atigyi commented 4 years ago

Thank you for catching this and reporting it in detail. Could we do pair programming here since it seems you dug deep in this topic?! Thanks in advance! So, it seems to me the resize happens but 3 bytes are not copied to the new place at realloc. The copy happens here. Could you please print out the desc->capacity and the desc->length. I assume the latter is larger by 3 which means a corrupt iotc_data_desc_t desc (capacity < lenght). The capacity wasn't updated correctly back somewhen. But before we progress, could you prove this assumption by printing out the values? Thanks!

rohansingh commented 4 years ago

Sure, I'll try capturing those values and keep you posted.

atigyi commented 4 years ago

Thanks, additionally capacity is some power of 2, and here the length is probably 2^n+3, but this is just an assumption, we will see.

rohansingh commented 4 years ago

It takes a bit of time for one of these incidents to occur, the frequency might be more like 1% rather than 5% of messages. Unfortunately it doesn't seem to be an issue with capacity not being updated. That was my first thought as well but seems like it's not the case.

I initially modified iotc_data_desc_realloc to add some logging and breakpoints like you suggested. However, the conditional breakpoint was never hit, and it appeared that desc->capacity was always larger than desc->length.

But now, I've added logging and debugging logic to iotc_data_desc_append_data_resize, as follows:

iotc_state_t iotc_data_desc_append_data_resize(iotc_data_desc_t* out,
                                               const char* const data,
                                               const size_t len) {
  if (out == NULL || data == NULL) {
    return IOTC_INVALID_PARAMETER;
  }

  char* orig = NULL;
  char* preMemcpy = NULL;

  iotc_state_t ret_state = IOTC_STATE_OK;

  // preserve original contents of out->data_ptr
  orig = malloc(out->length);
  memcpy(orig, out->data_ptr, out->length);

  // check null characters in original buffer
  int hasNullCharacters =
      (strnlen((const char*)out->data_ptr, out->length) != out->length);
  if (hasNullCharacters != 0) {
    // breakpoint 1
    printf("orig %d %p %p %d %p\n", hasNullCharacters, out, data, len, orig);
  }

  printf("pre: length %d, capacity %d\n", out->length, out->capacity);
  IOTC_CHECK_STATE(ret_state = iotc_data_desc_assure_buf_len(out, len));
  printf("post: length %d, capacity %d\n", out->length, out->capacity);

  // check for null chars after buffer resize
  hasNullCharacters =
      (strnlen((const char*)out->data_ptr, out->length) != out->length);
  if (hasNullCharacters != 0) {
    // breakpoint 2
    printf("1 %d %p %p %d %p\n", hasNullCharacters, out, data, len, orig);
  }

  // preserve contents of out->data_ptr for debugging
  preMemcpy = malloc(out->length);
  memcpy(preMemcpy, out->data_ptr, out->length);

  memcpy(out->data_ptr + out->length, data, len);
  out->length += len;

  // check for null chars after memcpy
  hasNullCharacters =
      (strnlen((const char*)out->data_ptr, out->length) != out->length);
  if (hasNullCharacters != 0) {
    // breakpoint 3
    printf("2 %d %p %p %d %p\n", hasNullCharacters, out, data, len, orig);
  }

err_handling:
  free(preMemcpy);
  free(orig);
  return ret_state;
}

Now, somehow breakpoints 1 and 2 are not hit, but breakpoint 3 does get hit. When that happened, the following had been printed to the log:

pre: length 390, capacity 512
post: length 390, capacity 512

Additionally:

  1. The preMemcpy buffer has 390 bytes of content without any null characters, as expected.
  2. out->data_ptr has 388 of the bytes from preMemcpy, two null characters, and all 32 of the bytes from data.

So it seems like sometimes two bytes are dropped instead of three. And it appears that the corruption is happening without the buffer being resized, possibly during memcpy. I'm not even sure how that's possible. Perhaps my original statement that this is an SDK issue and not a chip issue was wrong.

rohansingh commented 4 years ago

Looks like this is actually an instance of espressif/esp-idf#3006, which is caused by an underlying silicon issue. Apologies for the false report.

It seems very likely that other users of this library will run into the same bug on ESP32, since the logic for handling MQTT payloads is likely to trigger the bug under load by quickly reading/writing to the same buffer. For anyone else experiencing this, the underlying issue is being tracked at espressif/esp-idf#2892. The main workaround for now is to run in unicore mode with CONFIG_FREERTOS_UNICORE.