Rust port - Githubissues

Crypto-Spartan commented 3 years ago

Hi!

I'm interested in porting this project over to ~~Golang~~ Rust. I have a project I'm working on, and I think it would be pretty cool to use this library. I know this is covered by the BSD-3 license (which I believe allows for derived works,) but I wanted to ask @k0dai for permission and see his thoughts on it.

g1mv commented 3 years ago

Hello @Crypto-Spartan ! I don't see any problem with that ^^ You can re-use/re-develop all algorithms in your project with proper credits ! Thank you

Crypto-Spartan commented 3 years ago

Awesome! I have absolutely no problem giving credit to you and linking this library.

Will I be required to license the ported version using the BSD license as well?

g1mv commented 3 years ago

No ! As long as you keep the copyright notice, for me it's all good, you can choose any licence of your liking :)

Crypto-Spartan commented 3 years ago

Apologies, been a while since I've looked at this. Life got busy.

Is there any sort of documentation about how each of the algorithms themselves work? I'm not super familiar with C, so any other documentation would be helpful as I find C a bit difficult to read/decipher what's going on.

g1mv commented 3 years ago

Hello ! No there's no documentation, you're going to need to look inside the C code I'm afraid. If you take the chameleon (fastest and "simplest") algorithm for example, you need to take a look at chameleon_encode.c and specifically the density_chameleon_encode_kernel method (the nomenclature is similar for other algorithms). Here you can see how the encoding works: in two parts, via a 64-bit signature and a series of bytes. The algorithm reads input data by groups of 4 bytes (the other algos do the same, 4 byte processing is one of the strong points of this library speedwise), it then applies a hash to this 4-byte group using the DENSITY_CHAMELEON_HASH_ALGORITHM macro to obtain a 16-bit key. The key is then checked against the dictionary (which contains 2^16 4-byte entries, a small footprint making it readily available from processors' Lx cache hence access speed), to see if it has already been encountered previously. If that is the case and the value is identical to what was stored previously using the key, then a 1 is pushed in the signature, and the hash value (16 bits or 2 bytes) is written in the output stream. If not, then the 4 input bytes are copied in the output stream, and the key/value pair is inserted in the dictionary. Once the signature is full (64 bits have been utilized), its 8 bytes are copied to the output stream and it is reset to zeros, and so on. Other algorithms are derivations of this, for example Cheetah uses a 2-level dictionary (to enable a hash key to be associated with two values) with predictions lookup to store causal sequences, for example if A (4 bytes) is always followed by B (4 bytes) in the data stream, then if A comes up in the input stream, and B follows, a signature code is inserted to signal that the predicted next group is conformant with available predictions and this avoids writing any bytes to the output stream. If A is followed by some C (4 bytes) on the other hand, the full C is written to the output stream and the predictions lookup data structure is updated with prediction(A) -> C. Finally Lion has 4 levels of dictionary entries and 3 levels of prediction entries (lion_dictionary.h). The "*_dictionary.h" files give you the dictionary structures used by each algorithm.

Crypto-Spartan commented 3 years ago

Thank you! This description is definitely helpful.

Crypto-Spartan commented 2 years ago

So I have officially begun working on this after a long hiatus. I have to say, it is really difficult to follow the algorithms in the code, especially without any comments at all.

I've read over your previous comment 8-10 times at this point, and I still can't figure out how the signature works in tandem with the dictionary and output stream. I understand that it hashes 4 bytes and uses that as input in the dictionary, but I don't really understand where/how the signature comes into play.

I think there's a lot of potential with this library/algorithm and it could see more mainstream use, but a lack of documentation of the algorithm is holding it back quite a bit.

Edit: I would also love more info on the density_compress_safe_size() and density_decompress_safe_size() functions and how the sizes are calculated. I'm a bit confused with them because it looks like it calculates the size for all 3 algorithms (chameleon, lion & cheetah) even though you only use 1 algorithm at a time.

g1mv commented 2 years ago

Hello @Crypto-Spartan

I've read over your previous comment 8-10 times at this point, and I still can't figure out how the signature works in tandem with the dictionary and output stream. I understand that it hashes 4 bytes and uses that as input in the dictionary, but I don't really understand where/how the signature comes into play.

So, to take an example let's use the fastest algorithm (chameleon). If we take the point of view of the decoder, it will be easily understandable. Let's say the decoder starts reading the stream, and has read the initial headers. It now reads a signature on 8 bytes, which is none other than 64 0's or 1's, where a 1 indicates that the value has to be read from the decoder's dictionary (which is empty right now), and a 0 that the value is not in the decoder's dictionary. From the signature, if 0 is read => the decoder reads 4 bytes from the input stream (plain data), computes a 16-bit hash from those bytes and stores a key/value in the dictionary, the key being obviously the hash and the value the 4 bytes. If 1 is read => the decoder reads 2 bytes from the input stream and fetches the dictionary value (4 bytes) corresponding to this 16-bit key. The next bit in the signature in then read, and so on. After 64 bits the decoder knows it needs to read a new signature so now reads 8 bytes from the input stream to get this new signature, and the process starts again.

I think there's a lot of potential with this library/algorithm and it could see more mainstream use, but a lack of documentation of the algorithm is holding it back quite a bit.

Thank, it's more a lack of time than anything else but I'm open to pull requests to improve this part as well.

Edit: I would also love more info on the density_compress_safe_size() and density_decompress_safe_size() functions and how the sizes are calculated. I'm a bit confused with them because it looks like it calculates the size for all 3 algorithms (chameleon, lion & cheetah) even though you only use 1 algorithm at a time.

Basically these functions compute the max possible size of encoded/decoded input relative to a given encoded/decoded input size and algorithm. The aim is to enable the user to create a big-enough buffer to be able to store decoded input or encoded input depending on which is performed.

g1mv commented 2 years ago

I should add that on a personal level, I'm more and more interested by a rust port also (rust being a very interesting language for multiple reasons), so I could also join you on this effort and work myself on a port at some stage, it's very possible but not in the near future.

Crypto-Spartan commented 2 years ago

I understood that 8 bytes is 64 0's and 1's haha, im not brand new to programming in general. I struggle with C a lot because I don't really care for its syntax. Plus with the word "Density" used everywhere in type definitions it makes it difficult to just read the code and keep track of what each and every function, variable, and type definition is for. I had to use gdb to step through the code because I had trouble following everything while just trying to read the code by itself.

Thank you for this explanation though, it was extraordinarily helpful, I understand how chameleon works now.

So far I haven't gotten very far, I have created the basic program from the quick start in Rust and I'm using it to call the C Density code.

I had an issue where I sometimes got the DENSITY_STATE_ERROR_OUTPUT_BUFFER_TOO_SMALL from the density_compress_with_context() function after hitting the if statement at the beginning of the function. To me, if the function density_compress_safe_size() is working as intended, the if (output_size < sizeof(density_header)) here shouldn't ever evaluate to true. That's why I asked about the density_compress_safe_size() in my last post, because something seems off in that calculation. (Plus I still don't understand why the calculation is done for all 3 algorithms when I feel like it should only need to be done for the specific algorithm used at that time). Although I haven't gotten that error in a bit so I'm not entirely sure what caused it in the first place (sorry, I know that's not really helpful).

~~I will have my code up on my github hopefully soon.~~ The very early versions of the code are up on my github here.

Crypto-Spartan commented 2 years ago

If you do have any spare time - something that would be immensely helpful would be a wiki page or something with either an explanation (like yours for chameleon above) or pseudocode that shows how the other algorithms work in addition to chameleon. I think that would help a lot of others understand this compression library a lot better as well.

Crypto-Spartan commented 2 years ago

Follow up question: how are hash collisions handled in the dictionary?

g1mv commented 2 years ago

I understood that 8 bytes is 64 0's and 1's haha, im not brand new to programming in general. I struggle with C a lot because I don't really care for its syntax. Plus with the word "Density" used everywhere in type definitions it makes it difficult to just read the code and keep track of what each and every function, variable, and type definition is for.

Sorry didn't mean to sound pedantic at all, I was just trying to explain as accurately as possible how things were unrolling. Yes namespace can lead to errors when smaller libraries are used inside much bigger project which make use of potentially thousands of variables, so I opted for an extra "DENSITY_" name spacing in front of all variables to avoid any ambiguity but readability does take a hit because of that.

Follow up question: how are hash collisions handled in the dictionary?

That's quite straightforward actually, the algorithm follows the "last seen is better" paradigm: in other words, when there is a collision, the current dictionary entry is just overwritten with the new value. But since the process is exactly the same on the decoder side, it does not cause any problems. Of course too many collisions would be greatly detrimental to compression ratio, so I did a bit of research to find a super fast but good enough hashing function. It is a mulshift variant used throughout the library. For chameleon as an example, in chameleon.h : hash (16 bits) = (value (32 bits) * DENSITY_CHAMELEON_HASH_MULTIPLIER) >> 16 where DENSITY_CHAMELEON_HASH_MULTIPLIER = 0x9D6EF916. This value is not random, after extensive testing of different value arrays using numerous different sources of data, it generally showed very low collision count, that's why it made it as the hashing function's multiplier.

Crypto-Spartan commented 2 years ago

This value is not random, after extensive testing of different value arrays using numerous different sources of data, it generally showed very low collision count, that's why it made it as the hashing function's multiplier.

Do you have any data to back this up?

g1mv commented 2 years ago

No, I lost the test data since. I had used well known corpuses such as Silesia and others to set up the test. I was not looking after the absolute best hash multiplier for a given test set, but one that had good "no-collision" properties overall whatever the data.

Crypto-Spartan commented 2 years ago

I have made the most progress thus far. I have ported chameleon_encode.c to Rust, you can view my code here: https://github.com/Crypto-Spartan/density-rs/blob/main/src/algorithms/chameleon/encode.rs

Lots of things still subject to change, and the code isn't the best it could be, but it's a good start.

g1mv commented 2 years ago

Nice one ! I've looked at your code and it's already so much cleaner than C ;) Do you intend to reproduce the library's benchmark application or to create benchmarks formatted towards a rust benchmark framework? Do you have any initial remarks in regards to performance compared to the compiled-from-C version? Although I'm pretty sure they are quite similar. One thing I noticed is that you did not use prefetching in the rust version, on certain architectures it can speed things quite a lot to prefetch the next buffer while reading the current one (there is a stable _mm_prefetch in rust but only available for x86-64, I'm not sure if prefetching can be performed on other architectures such as ARM anyways), but this is more polishing than anything else.

Crypto-Spartan commented 2 years ago

Do you intend to reproduce the library's benchmark application or to create benchmarks formatted towards a rust benchmark framework?

I plan to use Rust benchmarking instead of recreating the library's benchmark tool. Rust has a lot of great benchmarking tools already built-in.

Do you have any initial remarks in regards to performance compared to the compiled-from-C version?

I created some benchmarks today for Chameleon encoding. C is still faster, but Rust isn't using any of the optimizations that C is utilizing. (Rust is still fast though)

Benchmark results (input data was an arbitrary 3440 byte string - nothing fancy):

Right now I'm mainly focusing on the algorithms themselves - I want to ensure correctness. Optimizations can always be added later.

Additionally, I've been thinking about a way to parallelize the algorithm (to use multiple CPU cores), but the only way I've found will work with Chameleon in it's current state is to parallelize the hashing and then iterate through the hashes sequentially.

Crypto-Spartan commented 2 years ago

I re-ran the benchmark with the DENSITY_PREFETCH line commented out and it didn't have much of an impact on performance.

g1mv commented 2 years ago

I created some benchmarks today for Chameleon encoding. C is still faster, but Rust isn't using any of the optimizations that C is utilizing. (Rust is still fast though)

What optimizations are you referring to? C-compiled program is 3 times faster which sounds a lot though, I'm sure you compiled the rust version in release mode so I wonder where the difference could take place. One thing I was doing in C is to hint the compiler to use 64-bit registers (most architectures are 64-bit nowadays) by reading 8 bytes at a time using unrolling macros, maybe it could increase the speed to read by groups of 8 bytes instead of using an iterator on 256 bytes chunks (you can then extract 4-byte groups using rshift and bit masking)? Also something very important is the handling of the dictionary by the CPU, and notably its placement in cache. It has to be sitting as close as possible (time-wise) to the computation parts of the CPU. For example, chameleon's dictionary is very small and can thus be stored entirely on cache L1 (the fastest in terms of access time from the processor) for most modern CPUs and this makes a tremendous difference performance-wise. This happened naturally with C-compiled code as the dictionary array was static, it would be worth checking out that no unnecessary memory copying takes place in the rust version maybe. Just a couple of ideas coming to mind...

I re-ran the benchmark with the DENSITY_PREFETCH line commented out and it didn't have much of an impact on performance.

Yeah that's negligible but on some platforms prefetch did make a difference, although this was 5 years ago, maybe CPU architectures have evolved in a way that makes prefetching obsolete.

g1mv commented 2 years ago

Additionally, I've been thinking about a way to parallelize the algorithm (to use multiple CPU cores), but the only way I've found will work with Chameleon in it's current state is to parallelize the hashing and then iterate through the hashes sequentially.

This is definitely doable with interleaving, but you'd need to use as much dictionaries as there are threads, as a dictionary is modified continuously you wouldn't be able to use the same dictionary for concurrent threads running encoding/decoding processes. With 8 threads for example, you create 8 dictionaries and split the file before sending each chunk to a thread with a sequence number. To enable parallelization on the decoder side, you need a header indicating the number of dictionaries necessary and each encoded chunk in the resulting output should also have a header indicating its length, using 8 bytes for example). That way, upon decoding, an encoded chunk is read and send to a thread (with a sequence number) for processing, and so on. Alternatively you could also have a big header at the beginning of the encoded data, indicating the offset of all chunks if reading parallelization is needed (the above solution needs to read encoded chunk headers sequentially as each size gives the start of the next chunk). Note that the above solution assumes that you are working with big files (multiple dictionaries "learn" slower overall as they are distinct), otherwise parallelization does not make much sense anyways.

Crypto-Spartan commented 2 years ago

What optimizations are you referring to?

I mainly meant the use of the DENSITY_UNLIKELY branch prediction hint as well as the loop unrolling. However, after further testing, the performance difference is less than I anticipated.

I'm sure you compiled the rust version in release mode

Yes, 100%. By default, benchmarks in rust use release mode anyways.

it would be worth checking out that no unnecessary memory copying takes place in the rust version maybe

AFAIK, the only places where copying bytes will occur are in the places where .copy_from_slice() is used, which is Rust's way of doing a memcopy. This is essentially my replacement for the DENSITY_MEMCPY macro.

As for the parallelization piece, I am holding off on multithreading. However, I have been experimenting with SIMD (mainly AVX2 and SSE4), and have been able to increase the speed even further.

On the topic of benchmarking: Initially I was only encoding a short string, the data wasn't representative of any real-world application. While I still need to test with some actual files, I replicated the short string 10,000 times to simulate more data that should be easily compressible. With more data, the Rust port was actually faster than C. (remember this is specifically only for encoding with Chameleon, nothing else is ported yet). After adding in some unsafe code to remove Rust's bounds checking where it's unnecessary (since C isn't doing bounds checking anyways) as well as adding in some loop unrolling, I was able to squeak out a little more performance from the Rust port. Here are the benchmark results:

RustTest: Removed bounds checking & added loop unrolling Rust: Original Rust port as benchmarked above C: Original C implementation as benchmarked above

I plan to merge my "RustTest" implementation as well as add SIMD to the benchmarks soon. In basic toy tests, the SIMD was ~59% faster than regular Rust. I'm interested to see how well it performs once it's added in to the full implementation.

g1mv commented 2 years ago

Very nice ! I'm curious to see that as well ! Introducing SIMD in the density algorithms was attempted a while ago (see http://cbloomrants.blogspot.com/2015/03/03-25-15-my-chameleon.html) but it did not result in any particular speed increase, so I find it quite interesting that you noticed a considerable improvement on your side.

g1mv / density

Rust port #90