bamler-lab / constriction

Entropy coders for research and production in Python and Rust.
https://bamler-lab.github.io/constriction/
Apache License 2.0
77 stars 5 forks source link

Encoding int8/unit8 streams of symbols #41

Closed aegroto closed 2 weeks ago

aegroto commented 6 months ago

Hello, thanks for releasing such a complete and well-documented library!

I have a doubt about its application on integer streams. Why does the library force you to cast arrays to numpy.int32? In my use case, in which I have to compress neural network weights, I find a general-purpose compression algorithm such as brotli orders of magnitude more efficient than both the Range and ANS coders, and I believe that representing symbols as 32-bit integers may be part of the issue.

robamler commented 6 months ago

Thanks for reporting! With "orders of magnitude more efficient", are you referring to bit rates or run times?

Regarding bit rates: constriction requires its users to provide the entropy model, and the default stream codes that can be used through constriction's python API should have almost imperceptible overhead over the theoretical lower bound for a given combination of entropy model and data source (see measurements in readme). If you're seeing higher bit rates than expected then that's a bug in construction.

Regarding run times: while the rust library of constriction focuses on speed (and is generic over the symbol type), the functionality that's exposed via constriction's python API currently isn't very much optimized for speed. This is firstly because the python API prioritizes ease of experimentation, and most run time optimizations would either make it more difficult to use constriction in certain domains, or they would hurt bit rates in some situations. And, secondly, since constriction is meant for experimental new compression techniques that require implementing some sort of new logic on the caller side after every few symbols (e.g., evaluating probablistic ML models), typical use cases spend the overwhelming majority of the run time in python code and not in constriction anyway. If your use case does not involve calculating complicated entropy models or experimenting with new types of compression schemes (e.g., bits-back coding) then a general-purpose compression algorithm like brotli is probably a better choice than constriction.

That said, it would be easy enough to expose int8 or uint8 as a symbol type to the python API since the rust implementation is generic over the symbol type anyway. It's just a bit unclear to me what kind of functionality would be reasonable to expose in this situation. Could you please tell me more about your use case: do you have numpy tensors with dtype=uint8 that you want to compress? And can the entries of these tensors take any value or are they limited to a small alphabet? What kind of entropy models would you then use for compression? In particular, are you encoding the entries of a large tensor in bulk (using an i.i.d. entropy model for the tensor entries) or do you want to model correlations between tensor components? (if it's the latter then I don't see a way to do this efficiently in python because you'll be fundamentally limited by the speed of the python interpreter, while the run time spent in constriction would be negligible).

aegroto commented 6 months ago

Thanks for the quick and wide response! By "efficiency" i mean bitrate, I am not really interested in the execution time at the moment as it is dominated by the time required to train the network. From your words I believe I may have used a suboptimal entropy model, I will try to provide you a minimum test as soon as possible to show you in practice what I meant.

robamler commented 3 months ago

Friendly ping @aegroto, did you manage to come up with some test case where the bit rate of constriction seems surprisingly high? Even if it turns out to be due to a simple misuse of the library it might be worth sharing so I can improve the documentation.

aegroto commented 3 months ago

Friendly ping @aegroto, did you manage to come up with some test case where the bit rate of constriction seems surprisingly high? Even if it turns out to be due to a simple misuse of the library it might be worth sharing so I can improve the documentation.

Hello! Sorry for missing any updates. I may have some examples and I am willing to provide you with the code that caused such a behaviour, although it was quote some time ago and as of now I am obtaining far better results by using constriction.

As soon as I remember, instead of casting each 8-bit symbol to 32-bit integers, I was reshaping the array such that each 32-bit integer contained four 8-bit symbols, which for sure broke any entropy consideration made for the original array. If you believe this is a misuse that could be added to the documentation let me know.

robamler commented 2 weeks ago

Closing this for now as it seems like a misunderstanding. I'm open to improving the documentation to prevent similar misunderstandings from reoccurring in the future, but I can't tell where to start without a concrete code example. Please feel free to reopen when a concrete code example is available, or to directly file a PR with suggested improvements to the documentation.