bamler-lab / constriction

Entropy coders for research and production in Python and Rust.
https://bamler-lab.github.io/constriction/
Apache License 2.0
77 stars 5 forks source link

Output Dict/Struct of Huffman Symbol Codes #28

Closed dnhmphrs closed 7 months ago

dnhmphrs commented 11 months ago

It would be very good for me (neuroinformatics research) if we could output the binary symbol codes from the Huffman tree.

I've been working in the python library but can write rust code. If this feature exists in the rust code I'm happy to write the python binding. If the feature isn't in rust code either, I'd be happy to write it and add the python binding also.

Let me know if this would be good, and maybe me a little direction about how you'd like it if so!

robamler commented 11 months ago

Thank you for reporting! Could you expand on your use case? The point of any source coding encoder is to convert a message (i.e., a sequence of symbols) to a bit string. The message can be of arbitrary length. In particular, when using a symbol code like Huffman coding, there's nothing wrong with encoding a message that consists of only a single symbol:

import constriction
import numpy as np

def single_code_word(symbol, code_book):
    encoder = constriction.symbol.QueueEncoder()
    encoder.encode_symbol(symbol, code_book)
    code_word, length = encoder.get_compressed()
    return f'{code_word[0]:0{length}b}'[::-1] # works for code words of length <=32

probabilities = np.array([0.1, 0.2, 0.3, 0.4])
code_book = constriction.symbol.huffman.EncoderHuffmanTree(probabilities)
print({symbol: single_code_word(symbol, code_book) for symbol in range(4)})

Are you looking for a more convenient way to do this? If so, I'd be curious which of the following best describes your use case:

  1. being able to call dict(tree) where tree is an EncoderHuffmanTree, and getting a dictionary that maps symbols to some representations of bit strings (like in the above example);
  2. having a method on EncoderHuffmanTree that takes a symbol as argument and outputs some representation of this symbol's code word;
  3. having a simple way to summarize an EncoderHuffmanTree in a human-readable string for debugging; or
  4. something else?
robamler commented 7 months ago

Closing this for now because it's not actionable without further information. As mentioned in my last comment, the specific requested feature already exists, and while it is very well possible that the existing solution may not be suitable for all use cases, further information is needed to clarify what exactly is missing. Without clarification of the intended use case, any attempt from my side to guess and address the underlying issue would likely run into the XY problem.

Please feel free to reopen if you can clarify the broader picture of what you're trying to achieve.