Assertion Failure in `huffman_decoding`

In #189 we are investigating how MGARD performs on some very noisy data. When this data is compressed with a tight error tolerance and then decompressed, we get an error in huffman_decoding. It seems that an integer overflow is involved.

To reproduce, compile the MGARD CLI with debugging flags. Here's how I configured the build:

$ cmake -S . -B "build" \
      -D CMAKE_BUILD_TYPE="Debug" \
      -D BUILD_SHARED_LIBS=OFF \
      -D BUILD_TESTING=OFF \
      -D MGARD_ENABLE_OPENMP=ON \
      -D MGARD_ENABLE_SERIAL=OFF \
      -D MGARD_ENABLE_CLI=ON

Build and install, and then run make in the attached directory. Here's what I get (linebreaks added):

$ make
g++    -c generate.cpp -o generate.o
g++     generate.o -o generate
./generate 256 logistic.dat
mgard compress --datatype double --shape 256x256x256 --smoothness 0 --tolerance 1.0e-10 \
      --input logistic.dat --output logistic_s_0_tol_1.0e-10.mgard
input size (bytes):  134217728
output size (bytes): 67116219
compression ratio:   1.99978
mgard decompress --input logistic_s_0_tol_1.0e-10.mgard --output logistic_s_0_tol_1.0e-10.dat
${HOME}/projects/MGARD/src/compressors.cpp:246:22: runtime error: signed integer overflow: \
      -2147429659 - 65536 cannot be represented in type 'int'
mgard: ${HOME}/projects/MGARD/src/compressors.cpp:258: void mgard::huffman_decoding(long int*, \
      std::size_t, unsigned char*, size_t, unsigned char*, size_t, unsigned char*, size_t): \
      Assertion `start_bit == out_data_hit_size' failed.
makefile:37: recipe for target 'logistic_s_0_tol_1.0e-10.dat' failed
make: *** [logistic_s_0_tol_1.0e-10.dat] Aborted

@qliu21, I'm assigning you since you wrote the function, but I can start looking into this and we can talk about it next week.

I am in the process of rewriting huffman_encode and huffman_decode. I will collect in this comment a few easily-summarizable issues I encounter in the course of that task.

In this statement, quantized_data has type long int * but q has type int. quantized_data[i] is implicitly converted to an int, potentially resulting in a loss of information.
The Huffman codes are written to the output stream in a nonportable way in huffman_encoding. See below.

build_codec builds the Huffman codes bit by bit, starting with the least significant bit. The codes are stored as unsigned ints; see the definition of huffman_codec. In huffman_encoding, the codes are written into the output stream using unsigned int bitwise operations (note that the type of cur is unsigned int *). Imagine we're in the middle of writing the Huffman codes into the output stream. Suppose sizeof(unsigned int) == 2, CHAR_BIT == 8, our 'writing cursor' is in the middle of an unsigned int (we just wrote to an unsigned int's least significant byte and are now ready to write to its most significant byte), the last byte we wrote was 0xa5, and we now want to write the code 255 with length 8 (byte 0xff). If unsigned int is little-endian, the writing operation will look like this:

[0xa5 0x00] | ([0xff 0x00] << 8]) = [0xa5 0x00] | [0x00 0xff] = [0xa5 0xff]

If instead unsigned int is big-endian, the writing operation will look like this:

[0x00 0xa5] | ([0x00 0xff] << 8) [0x00 0xa5] | [0xff 0x00] = [0xff 0xa5]

The problem is that the output of huffman_encoding is treated as a stream of bytes in compress_memory_huffman (here or here), so this code isn't portable. The lossless compressor will see 0xa5 first on some architectures and 0xff first on others.

This statement (among other examples) is not portable because sizeof(int) may differ between architectures.
This isn't an error per se, but it's counterintuitive and probably not what we intended. In compress_memory_huffman, this line will copy sizeof(unsigned int) bytes more than necessary if out_data_hit_size is an integral multiple of CHAR_BIT * sizeof(unsigned int) (assuming CHAR_BIT == 8 and sizeof(unsigned int) == 4).

CODARcode / MGARD

Assertion Failure in `huffman_decoding` #190