facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.83k stars 2.11k forks source link

[Help Wanted, Questions] Improving Dictionary Training Process #4127

Open neiljohari opened 3 months ago

neiljohari commented 3 months ago

Hi team,

We are integrating zstd + shared compression dictionaries at Roblox for serving feature flag payloads! We think this is a good use case because the payload looks similar over time (people add new flags, flip them, or delete them but week-over-week the data is very similar) and we control the client so we can ship it with the dictionary.

I've been playing around with various training parameters and set up a few harnesses to help try out different param combinations (and found previous insight here), and found a few approaches that seem to work well. However it feels a bit like blindly guessing + we aren't sure if this is the best we can do, and we were wondering if the maintainers/community have insight into how we can improve our process.

Our payloads currently are ~435kb JSON that differ by client type, though the majority of flags between clients are the same. Examples:

We currently artificially expand our payload into a bunch of training files:

We validate the compression ratio it achieves on historical payloads to see the dictionary's effectiveness over time.

Example of our training dir structure:

├── 2024-08-13
│   ├── AndroidApp.json_split_2048
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_2048_copy_1
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_2048_copy_2
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_4096
│   │   ├── AndroidApp.json_chunk_4096_000000.json
│   │   ├── AndroidApp.json_chunk_4096_000001.json
│   │   ├── AndroidApp.json_chunk_4096_000002.json
│   │   ├── AndroidApp.json_chunk_4096_000003.json
│   │   ├── AndroidApp.json_chunk_4096_000004.json
│   ├── AndroidApp.json_split_4096_copy_1
│       ├── AndroidApp.json_chunk_4096_000000.json
│       ├── AndroidApp.json_chunk_4096_000001.json
│       ├── AndroidApp.json_chunk_4096_000002.json
│       ├── AndroidApp.json_chunk_4096_000003.json
│       ├── AndroidApp.json_chunk_4096_000004.json

Some things we've noticed that we don't quite understand:

Thanks in advance for your time, we're really excited to roll this out soon and would love your insights on how we can do even better!

neiljohari commented 2 months ago

Hey @Cyan4973!

Apologies for the direct ping, just wondering if you had any insight on our questions & whether our process is sensible, or if you see room for improvement?

Thanks!

Cyan4973 commented 2 months ago

Hi @neiljohari ,

it's a complex use case, unfortunately, it will require more than one sentence to answer.

Our payloads currently are ~435kb JSON

That's a starting point. In general, 435 KB is supposed to be large enough to compress well without dictionary. Dictionary can still help a little bit, but generally, it's not worth the cost.

That being said, this statement is valid in general. Your use case might well be more specific, and the above statement wouldn't work. In which case, this is kind of uncharted territory (or at least something we don't encounter regularly on our side).

It would be relatively simple to manufacture a counter example: imagine a file which is hard to compress (let's say, it consists of encrypted segments), so on its own, it may be 500 KB, but it compresses very poorly. Then, you might have multiple files which are of the same kind, meaning they consist of about the same encrypted segments, that are repeated across files, but never within a file. In which case, assuming the quantity of segments is tractable, a dictionary will work wonders, and there will be a big compression difference between using a dictionary or not, even though the original file is "large" and therefore presumed to not benefit from dictionary.

While the scenario described above is a trope, it wouldn't be too hard to be in a somewhat related scenario. For example, we could imagine a very large template, with a lot of static content essentially repeated across documents, but not necessarily within the document, and just a few short answers being different (and even then, often similar).

And now, I wonder if that's what's happening for your use case.

One of the problems of the dictionary builder is that, that's not the use case it's aiming for. It's rather expecting some small segments present more or less randomly across multiple small inputs. That's very different from large sections being identical across large inputs. So the dictionary builder will make the mistake to cut these large sections into small segments, and introduce a lot of random noise (and lose a bunch of efficiency) in the process.

Hence, the "ideal" dictionary builder for this use case doesn't exist, or at least is not (yet) provided by this repository.

One suggestion could be to use one complete JSON file as a point of reference, that would become the dictionary. Assuming that most other files are highly related to this one, the common portions should be found during the compression process, it would make a sensible difference on compression ratio. Of course, decompression will have to use the same reference.

It's a relatively simple test, and should give some information on how to move forward with more ideas.