Hi, I found an heap out-of-bound read in UTF8Util.hpp.
Here are two POCs in the attachment, both can trigger heap out-of-bound read.
For POC1, I compiled opencc_phrase_extract with address sanitizer (ASAN), prove it like this.
./opencc_phrase_extract -o tmp.txt poc1
Then, ASAN would catch the error:
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/work/OpenCC/src/UTF8Util.hpp:49:15 in opencc::UTF8Util::NextCharLengthNoException(char const*)
Shadow bytes around the buggy address:
0x0c067fff8080: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa
0x0c067fff8090: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00
0x0c067fff80a0: 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa
0x0c067fff80b0: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa
0x0c067fff80c0: fa fa 00 00 00 fa fa fa fd fd fd fd fa fa 00 00
=>0x0c067fff80d0: 00 00 fa fa fd fd fd fd fa[fa]00 00 00 07 fa fa
0x0c067fff80e0: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa
0x0c067fff80f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c067fff8120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
For POC2, I compiled opencc_phrase_extract without address sanitizer (ASAN), prove it also like this.
./opencc_phrase_extract -o tmp.txt poc2
Then, a C++ exception is thrown:
ShouldNotBeHere! This must be a bug.
To further explore the root cause of this OOB read, I find the reason may happen in UTF8Util::PrevCharLength(). This function first deal with a 3 byte long character, however when given the input file with all 1 byte long characters. Here would lead to an heap OOB read.
Hi, I found an heap out-of-bound read in UTF8Util.hpp.
Here are two POCs in the attachment, both can trigger heap out-of-bound read. For POC1, I compiled opencc_phrase_extract with address sanitizer (ASAN), prove it like this. ./opencc_phrase_extract -o tmp.txt poc1
Then, ASAN would catch the error: SUMMARY: AddressSanitizer: heap-buffer-overflow /home/work/OpenCC/src/UTF8Util.hpp:49:15 in opencc::UTF8Util::NextCharLengthNoException(char const*) Shadow bytes around the buggy address: 0x0c067fff8080: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa 0x0c067fff8090: fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 0x0c067fff80a0: 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa fa fa 0x0c067fff80b0: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa 0x0c067fff80c0: fa fa 00 00 00 fa fa fa fd fd fd fd fa fa 00 00 =>0x0c067fff80d0: 00 00 fa fa fd fd fd fd fa[fa]00 00 00 07 fa fa 0x0c067fff80e0: 00 00 00 fa fa fa 00 00 00 fa fa fa 00 00 00 fa 0x0c067fff80f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8100: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8110: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8120: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
For POC2, I compiled opencc_phrase_extract without address sanitizer (ASAN), prove it also like this. ./opencc_phrase_extract -o tmp.txt poc2
Then, a C++ exception is thrown: ShouldNotBeHere! This must be a bug.
To further explore the root cause of this OOB read, I find the reason may happen in UTF8Util::PrevCharLength(). This function first deal with a 3 byte long character, however when given the input file with all 1 byte long characters. Here would lead to an heap OOB read.
Hope you can respond soon :) Thank you!
poc.zip