CODARcode / MGARD

MGARD: MultiGrid Adaptive Reduction of Data
Apache License 2.0
37 stars 25 forks source link

Assertion Failure in `huffman_decoding` #190

Open ben-e-whitney opened 2 years ago

ben-e-whitney commented 2 years ago

In #189 we are investigating how MGARD performs on some very noisy data. When this data is compressed with a tight error tolerance and then decompressed, we get an error in huffman_decoding. It seems that an integer overflow is involved.

To reproduce, compile the MGARD CLI with debugging flags. Here's how I configured the build:

$ cmake -S . -B "build" \
      -D CMAKE_BUILD_TYPE="Debug" \
      -D BUILD_SHARED_LIBS=OFF \
      -D BUILD_TESTING=OFF \
      -D MGARD_ENABLE_OPENMP=ON \
      -D MGARD_ENABLE_SERIAL=OFF \
      -D MGARD_ENABLE_CLI=ON

Build and install, and then run make in the attached directory. Here's what I get (linebreaks added):

$ make
g++    -c generate.cpp -o generate.o
g++     generate.o -o generate
./generate 256 logistic.dat
mgard compress --datatype double --shape 256x256x256 --smoothness 0 --tolerance 1.0e-10 \
      --input logistic.dat --output logistic_s_0_tol_1.0e-10.mgard
input size (bytes):  134217728
output size (bytes): 67116219
compression ratio:   1.99978
mgard decompress --input logistic_s_0_tol_1.0e-10.mgard --output logistic_s_0_tol_1.0e-10.dat
${HOME}/projects/MGARD/src/compressors.cpp:246:22: runtime error: signed integer overflow: \
      -2147429659 - 65536 cannot be represented in type 'int'
mgard: ${HOME}/projects/MGARD/src/compressors.cpp:258: void mgard::huffman_decoding(long int*, \
      std::size_t, unsigned char*, size_t, unsigned char*, size_t, unsigned char*, size_t): \
      Assertion `start_bit == out_data_hit_size' failed.
makefile:37: recipe for target 'logistic_s_0_tol_1.0e-10.dat' failed
make: *** [logistic_s_0_tol_1.0e-10.dat] Aborted

@qliu21, I'm assigning you since you wrote the function, but I can start looking into this and we can talk about it next week.

ben-e-whitney commented 2 years ago

I am in the process of rewriting huffman_encode and huffman_decode. I will collect in this comment a few easily-summarizable issues I encounter in the course of that task.

build_codec builds the Huffman codes bit by bit, starting with the least significant bit. The codes are stored as unsigned ints; see the definition of huffman_codec. In huffman_encoding, the codes are written into the output stream using unsigned int bitwise operations (note that the type of cur is unsigned int *). Imagine we're in the middle of writing the Huffman codes into the output stream. Suppose sizeof(unsigned int) == 2, CHAR_BIT == 8, our 'writing cursor' is in the middle of an unsigned int (we just wrote to an unsigned int's least significant byte and are now ready to write to its most significant byte), the last byte we wrote was 0xa5, and we now want to write the code 255 with length 8 (byte 0xff). If unsigned int is little-endian, the writing operation will look like this:

[0xa5 0x00] | ([0xff 0x00] << 8]) = [0xa5 0x00] | [0x00 0xff] = [0xa5 0xff]

If instead unsigned int is big-endian, the writing operation will look like this:

[0x00 0xa5] | ([0x00 0xff] << 8) [0x00 0xa5] | [0xff 0x00] = [0xff 0xa5]

The problem is that the output of huffman_encoding is treated as a stream of bytes in compress_memory_huffman (here or here), so this code isn't portable. The lossless compressor will see 0xa5 first on some architectures and 0xff first on others.