BYVoid / OpenCC

Conversion between Traditional and Simplified Chinese
https://opencc.byvoid.com/
Apache License 2.0
8.46k stars 982 forks source link

Heap Out-Of-Bound Read in UTF8Util.hpp #794

Open morningbread opened 1 year ago

morningbread commented 1 year ago

Hi, I found an heap out-of-bound read in UTF8Util.hpp.

Here are two POCs in the attachment, both can trigger heap out-of-bound read. For POC1, I compiled opencc_phrase_extract with address sanitizer (ASAN), prove it like this. ./opencc_phrase_extract -o tmp.txt poc1

Then, ASAN would catch the error: SUMMARY: AddressSanitizer: heap-buffer-overflow /home/work/OpenCC/src/UTF8Util.hpp:49:15 in opencc::UTF8Util::NextCharLengthNoException(char const*) Shadow bytes around the buggy address: 0x0c067fff8080: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa 0x0c067fff8090: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 0x0c067fff80a0: 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa 0x0c067fff80b0: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa 0x0c067fff80c0: fa fa 00 00 00 fa fa fa fd fd fd fd fa fa 00 00 =>0x0c067fff80d0: 00 00 fa fa fd fd fd fd fa[fa]00 00 00 07 fa fa 0x0c067fff80e0: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa 0x0c067fff80f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa

For POC2, I compiled opencc_phrase_extract without address sanitizer (ASAN), prove it also like this. ./opencc_phrase_extract -o tmp.txt poc2

Then, a C++ exception is thrown: ShouldNotBeHere! This must be a bug.

To further explore the root cause of this OOB read, I find the reason may happen in UTF8Util::PrevCharLength(). This function first deal with a 3 byte long character, however when given the input file with all 1 byte long characters. Here would lead to an heap OOB read.

Hope you can respond soon :) Thank you!

  static size_t PrevCharLength(const char* str) {
    {
      const size_t length = NextCharLengthNoException(str - 3);
      if (length == 3) {
        return length;
      }
    }
    {
      const size_t length = NextCharLengthNoException(str - 1);
      if (length == 1) {
        return length;
      }
    }
    {
      const size_t length = NextCharLengthNoException(str - 2);
      if (length == 2) {
        return length;
      }
    }
    for (size_t i = 4; i <= 6; i++) {
      const size_t length = NextCharLengthNoException(str - i);
      if (length == i) {
        return length;
      }
    }
    throw InvalidUTF8(str);
  }

poc.zip