Open ben-e-whitney opened 2 years ago
I am in the process of rewriting huffman_encode
and huffman_decode
. I will collect in this comment a few easily-summarizable issues I encounter in the course of that task.
quantized_data
has type long int *
but q
has type int
. quantized_data[i]
is implicitly converted to an int
, potentially resulting in a loss of information.huffman_encoding
. See below.build_codec
builds the Huffman codes bit by bit, starting with the least significant bit. The codes are stored as unsigned int
s; see the definition of huffman_codec
. In huffman_encoding
, the codes are written into the output stream using unsigned int
bitwise operations (note that the type of cur
is unsigned int *
). Imagine we're in the middle of writing the Huffman codes into the output stream. Suppose sizeof(unsigned int) == 2
, CHAR_BIT == 8
, our 'writing cursor' is in the middle of an unsigned int
(we just wrote to an unsigned int
's least significant byte and are now ready to write to its most significant byte), the last byte we wrote was 0xa5
, and we now want to write the code 255
with length 8
(byte 0xff
). If unsigned int
is little-endian, the writing operation will look like this:
[0xa5 0x00] | ([0xff 0x00] << 8]) = [0xa5 0x00] | [0x00 0xff] = [0xa5 0xff]
If instead unsigned int
is big-endian, the writing operation will look like this:
[0x00 0xa5] | ([0x00 0xff] << 8) [0x00 0xa5] | [0xff 0x00] = [0xff 0xa5]
The problem is that the output of huffman_encoding
is treated as a stream of bytes in compress_memory_huffman
(here or here), so this code isn't portable. The lossless compressor will see 0xa5
first on some architectures and 0xff
first on others.
sizeof(int)
may differ between architectures.compress_memory_huffman
, this line will copy sizeof(unsigned int)
bytes more than necessary if out_data_hit_size
is an integral multiple of CHAR_BIT * sizeof(unsigned int)
(assuming CHAR_BIT == 8
and sizeof(unsigned int) == 4
).
In #189 we are investigating how MGARD performs on some very noisy data. When this data is compressed with a tight error tolerance and then decompressed, we get an error in
huffman_decoding
. It seems that an integer overflow is involved.To reproduce, compile the MGARD CLI with debugging flags. Here's how I configured the build:
Build and install, and then run
make
in the attached directory. Here's what I get (linebreaks added):@qliu21, I'm assigning you since you wrote the function, but I can start looking into this and we can talk about it next week.