byronknoll / cmix

cmix is a lossless data compression program aimed at optimizing compression ratio at the cost of high CPU/memory usage.
http://www.byronknoll.com/cmix.html
GNU General Public License v3.0
601 stars 44 forks source link

[memcpy-param-overlap](paq8.cpp:2862): `ptr_src <= ptr_dest < ptr_src + length` occurs in `memcpy` #61

Closed benehalo closed 3 months ago

benehalo commented 3 months ago

Dear All,

This bug was found on Ubuntu 20.04 64-bit & cmix was checked out from master branch of GitHub repository. Its commit is 6deea578f41a6206bee9cb112fc843bac5f7980f (Sun Mar 17 08:58:25 2024 -0700).

cmix was built with ASAN using clang-14. The compile command was:

cd $PROJECT_SRC
clang++ -DFORTIFY_SOURCE -fstack-protector-all -fsanitize=address -g -march=native  src/coder/decoder.cpp src/coder/encoder.cpp src/context-manager.cpp src/contexts/bit-context.cpp src/contexts/bracket-context.cpp src/contexts/combined-context.cpp src/contexts/context-hash.cpp src/contexts/indirect-hash.cpp src/contexts/interval-hash.cpp src/contexts/interval.cpp src/contexts/sparse.cpp src/mixer/byte-mixer.cpp src/mixer/lstm-layer.cpp src/mixer/lstm.cpp src/mixer/mixer-input.cpp src/mixer/mixer.cpp src/mixer/sigmoid.cpp src/mixer/sse.cpp src/models/bracket.cpp src/models/byte-model.cpp src/models/direct-hash.cpp src/models/direct.cpp src/models/indirect.cpp src/models/fxcmv1.cpp src/models/match.cpp src/models/paq8.cpp src/models/ppmd.cpp src/predictor.cpp src/preprocess/dictionary.cpp src/preprocess/preprocessor.cpp src/runner.cpp src/states/nonstationary.cpp src/states/run-map.cpp -o cmix

To reproduce: Download and unzip the attached zip archive, and get POCs

$PROJECT_SRC/cmix -n [poc] /dev/null

Notably, all the POCs in the zip archive trigger the same bug, and surprisingly, some POCs are unable to trigger this bug consistently and require multiple attempts.

Bug Analysis

The memcpy statement in paq8.cpp:2862 is as follows.

memcpy(&W->Letters[i+2], &W->Letters[i+1], MAX_WORD_SIZE-i-2);

The dest memory region might overlap with the src memory region with &W->Letters[i+1] <= &W->Letters[i+2] < &W->Letters[i+1] + MAX_WORD_SIZE-i-2, which results in unexpected behavior of memcpy.

Maybe this bug could be simply fixed by replacing memcpy with memmove.

GDB says

Breakpoint 1, paq8::GermanStemmer::ReplaceSharpS (this=0x5632d50, W=0x91aca80) at src/models/paq8.cpp:2862
warning: Source file is more recent than executable.
2862              memcpy(&W->Letters[i+2], &W->Letters[i+1], MAX_WORD_SIZE-i-2);
(gdb) p &W->Letters[i+2]
$1 = (paq8::U8 *) 0x91aca82 "is"
(gdb) p &W->Letters[i+2] + MAX_WORD_SIZE-i-2
No symbol "MAX_WORD_SIZE" in current context.
(gdb) p &W->Letters[i+2] + 64-i-2
$2 = (paq8::U8 *) 0x91acac0 ""
(gdb) p &W->Letters[i+1] 
$3 = (paq8::U8 *) 0x91aca81 "his"
(gdb) p &W->Letters[i+1] + 64-i-2
$4 = (paq8::U8 *) 0x91acabf ""
(gdb) p &W->Letters[i+2] - &W->Letters[i+1]
$5 = 1

Obviously, there is a memory overlap in this memcpy.

ASAN says

==1867601==ERROR: AddressSanitizer: memcpy-param-overlap: memory ranges [0x619000056c52,0x619000056c90) and [0x619000056c51, 0x619000056c8f) overlap
    #0 0x4c4179 in __asan_memcpy /llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3
    #1 0x6209b9 in paq8::GermanStemmer::ReplaceSharpS(paq8::Word*) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:2862:11
    #2 0x620025 in paq8::GermanStemmer::Stem(paq8::Word*) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:2977:5
    #3 0x598c07 in paq8::TextModel::Update(paq8::Buf&, paq8::ModelStats*) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:3258:28
    #4 0x60690e in paq8::TextModel::Predict(paq8::Mixer&, paq8::Buf&, paq8::ModelStats*) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:3160:7
    #5 0x5e8b70 in paq8::contextModel2(paq8::ModelStats*) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8183:13
    #6 0x5e9f26 in paq8::Predictor::update() /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8277:11
    #7 0x5ed309 in PAQ8::Perceive(int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8383:15
    #8 0x65ea86 in Predictor::Perceive(int) /data/symccgo/bug/cmix/cmix/src/predictor.cpp:416:12
    #9 0x4ffd8e in Encoder::Encode(int) /data/symccgo/bug/cmix/cmix/src/coder/encoder.cpp:23:7
    #10 0x6a3776 in Compress(unsigned long long, std::basic_ifstream<char, std::char_traits<char> >*, std::basic_ofstream<char, std::char_traits<char> >*, unsigned long long*, Predictor*) /data/symccgo/bug/cmix/cmix/src/runner.cpp:106:9
    #11 0x6a4c30 in RunCompression(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, _IO_FILE*, unsigned long long*, unsigned long long*) /data/symccgo/bug/cmix/cmix/src/runner.cpp:203:3
    #12 0x6a69d4 in main /data/symccgo/bug/cmix/cmix/src/runner.cpp:298:10
    #13 0x7f8fc5028082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
    #14 0x41fb4d in _start (/data/symccgo/bug/cmix/obj-asan-dbg/cmix+0x41fb4d)

0x619000056c52 is located 722 bytes inside of 960-byte region [0x619000056980,0x619000056d40)
allocated by thread T0 here:
    #0 0x4c5877 in calloc /llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3
    #1 0x62bb1e in paq8::Array<paq8::Word, 0>::create(unsigned int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:118:16
    #2 0x5f1547 in paq8::Array<paq8::Word, 0>::Array(unsigned int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:76:28
    #3 0x6177aa in paq8::Cache<paq8::Word, 8u>::Cache() /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:3013:22
    #4 0x6000c5 in paq8::TextModel::TextModel(unsigned int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:3138:3
    #5 0x5e7b70 in paq8::contextModel2(paq8::ModelStats*) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8104:20
    #6 0x5e9f26 in paq8::Predictor::update() /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8277:11
    #7 0x5ed309 in PAQ8::Perceive(int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8383:15
    #8 0x65ea86 in Predictor::Perceive(int) /data/symccgo/bug/cmix/cmix/src/predictor.cpp:416:12
    #9 0x4ffd8e in Encoder::Encode(int) /data/symccgo/bug/cmix/cmix/src/coder/encoder.cpp:23:7
    #10 0x6a3776 in Compress(unsigned long long, std::basic_ifstream<char, std::char_traits<char> >*, std::basic_ofstream<char, std::char_traits<char> >*, unsigned long long*, Predictor*) /data/symccgo/bug/cmix/cmix/src/runner.cpp:106:9
    #11 0x6a4c30 in RunCompression(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, _IO_FILE*, unsigned long long*, unsigned long long*) /data/symccgo/bug/cmix/cmix/src/runner.cpp:203:3
    #12 0x6a69d4 in main /data/symccgo/bug/cmix/cmix/src/runner.cpp:298:10
    #13 0x7f8fc5028082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

0x619000056c51 is located 721 bytes inside of 960-byte region [0x619000056980,0x619000056d40)
allocated by thread T0 here:
    #0 0x4c5877 in calloc /llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3
    #1 0x62bb1e in paq8::Array<paq8::Word, 0>::create(unsigned int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:118:16
    #2 0x5f1547 in paq8::Array<paq8::Word, 0>::Array(unsigned int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:76:28
    #3 0x6177aa in paq8::Cache<paq8::Word, 8u>::Cache() /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:3013:22
    #4 0x6000c5 in paq8::TextModel::TextModel(unsigned int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:3138:3
    #5 0x5e7b70 in paq8::contextModel2(paq8::ModelStats*) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8104:20
    #6 0x5e9f26 in paq8::Predictor::update() /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8277:11
    #7 0x5ed309 in PAQ8::Perceive(int) /data/symccgo/bug/cmix/cmix/src/models/paq8.cpp:8383:15
    #8 0x65ea86 in Predictor::Perceive(int) /data/symccgo/bug/cmix/cmix/src/predictor.cpp:416:12
    #9 0x4ffd8e in Encoder::Encode(int) /data/symccgo/bug/cmix/cmix/src/coder/encoder.cpp:23:7
    #10 0x6a3776 in Compress(unsigned long long, std::basic_ifstream<char, std::char_traits<char> >*, std::basic_ofstream<char, std::char_traits<char> >*, unsigned long long*, Predictor*) /data/symccgo/bug/cmix/cmix/src/runner.cpp:106:9
    #11 0x6a4c30 in RunCompression(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, _IO_FILE*, unsigned long long*, unsigned long long*) /data/symccgo/bug/cmix/cmix/src/runner.cpp:203:3
    #12 0x6a69d4 in main /data/symccgo/bug/cmix/cmix/src/runner.cpp:298:10
    #13 0x7f8fc5028082 in __libc_start_main /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16

SUMMARY: AddressSanitizer: memcpy-param-overlap /llvm-project/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3 in __asan_memcpy
==1867601==ABORTING

POC

attached zip archive

byronknoll commented 3 months ago

Thanks, fixed (changed to memmove)