BenWiederhake / splaycompress

Compression utility using Splay Trees, named ".jan" after Robert Tarjan, to enable the file extension ".tar.jan".
MIT License
3 stars 1 forks source link

Related work #7

Closed funny-falcon closed 2 months ago

funny-falcon commented 2 months ago

Googling "splaycompression" gives following links with related thing:

https://homepage.cs.uiowa.edu/~dwjones/compress/ https://ftp.arnes.si/security/crypto-tools/rpub.cl.msu.edu/crypt/other/jones-splay-compression.zip

BenWiederhake commented 2 months ago

Looks like Charles Holland Duell was right again :D

Thanks for the link, I'll read up on what exactly they do (differently)! :)

Is there anything in particular you'd like to see changed?

funny-falcon commented 2 months ago

Douglas W.Jones didn't use Splay Tree as a search tree. Instead he invented Splay-Prefix Tree algorithm that is more suited for the task. For compression he starts directly from the symbol and determines bit-sequence by going through up links of tree nodes. Also mutations in Splay-Prefix Tree are much simpler than in Splay search Tree.

Also he investigated composition of Splay-Prefix Tree with kind of Markov Chain: ie having "forest" of Splay-Prefix Trees choosing next tree depending on last symbol.

PDF is quite interesting, I read it today in one breath.

BenWiederhake commented 2 months ago

This project already uses a Splay-Prefix Tree, even though I never called it as such. The bit order hardly matters, does it? The diagrams in the paper don't preserve the binary search tree property (looks like a typo), and is very wrong about the usefulness as encryption, as already debunked. The mutations are "much simpler" because they drop the Zig-Zag step, which fundamentally changes the asymptotic behavior of Splay trees, and I wouldn't trust that all the analyses of Splay trees still apply to this "semi-splaying". "Tree with a kind of Markov Chain" does indeed sound interesting, but requires some form of domain knowledge of the data being compressed. "So if we know nothing about the properties of the data we are compressing, we might as well not compress it at all", as Wikipedia puts it.

The combination with Arithmetic Compression seems very salient, because it permits the compression scheme to use (potentially) less than one bit per symbol. However, Arithmetic Compression is famously riddled with patents and other shenanigans, so I'm not willing to deal with these problems in "just a fun side project". Also, I am not sure how that would even work.

So yes, there are some differences and some parallels to DWJ's work. Thank you for bringing it to my attention, it was an interesting read! However, for now I decide against using his ideas in this project.

funny-falcon commented 2 months ago

This project already uses a Splay-Prefix Tree, even though I never called it as such.

Really? Looking at the https://github.com/BenWiederhake/splaycompress/blob/7db1f6661566ce1c55520d5f8eb9ad3b8e8c1293/src/lib.rs#L111 I suppose it uses Splay search Tree, not Splay-Prefix Tree, since it searches current symbol in a tree. Algorithm in the paper doesn't do searching.

The mutations are "much simpler" because they drop the Zig-Zag step

No, mutations are simpler because they drop ordering. Since Splay-Prefix Tree is not used for searching, they need not to preserve ordering property ( node.left.val <= node.val && node.val <= node.right.val ). More over, internal nodes have no value at all.

"Tree with a kind of Markov Chain" does indeed sound interesting, but requires some form of domain knowledge of the data being compressed.

No, they shows that adaptivity of Splay-Prefix Tree plays together with "Markov Chaining" making adaptiveness stronger. Paper results show same "chaining" algorithm improves compression of data of different kinds without prior knowledge about data (given there are enough "Markov states").

funny-falcon commented 4 days ago

splay binary from the jones-splay-compression.zip compresses https://github.com/pocoproject/poco/blob/main/Data/SQLite/src/sqlite3.c to 5.8M with single state. (With 256 states it compresses to 3.7M, with default 32 states - to 4.2M)

jan compresses same file to 6.6M .

So referred Jones's algorithm is clearly supperior.

BenWiederhake commented 3 days ago

I'm glad that you found a superior algorithm.

May I remind you that this is just a fun side project, and was never meant to be the world's best compression utility?