[Help Wanted, Questions] Improving Dictionary Training Process

neiljohari commented 3 months ago

Hi team,

We are integrating zstd + shared compression dictionaries at Roblox for serving feature flag payloads! We think this is a good use case because the payload looks similar over time (people add new flags, flip them, or delete them but week-over-week the data is very similar) and we control the client so we can ship it with the dictionary.

I've been playing around with various training parameters and set up a few harnesses to help try out different param combinations (and found previous insight here), and found a few approaches that seem to work well. However it feels a bit like blindly guessing + we aren't sure if this is the best we can do, and we were wondering if the maintainers/community have insight into how we can improve our process.

Our payloads currently are ~435kb JSON that differ by client type, though the majority of flags between clients are the same. Examples:

We currently artificially expand our payload into a bunch of training files:

Split these files at byte boundaries (e.g. every 1kb, 2kb, 4kb) and write them to new files.
Make multiple copies of these files.
Training command: zstd --ultra -T0 -22 -f -r --train {training_dir} -o {dict_path} --maxdict {max_dict_size} --train-fastcover

We validate the compression ratio it achieves on historical payloads to see the dictionary's effectiveness over time.

Example of our training dir structure:

├── 2024-08-13
│   ├── AndroidApp.json_split_2048
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_2048_copy_1
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_2048_copy_2
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_4096
│   │   ├── AndroidApp.json_chunk_4096_000000.json
│   │   ├── AndroidApp.json_chunk_4096_000001.json
│   │   ├── AndroidApp.json_chunk_4096_000002.json
│   │   ├── AndroidApp.json_chunk_4096_000003.json
│   │   ├── AndroidApp.json_chunk_4096_000004.json
│   ├── AndroidApp.json_split_4096_copy_1
│       ├── AndroidApp.json_chunk_4096_000000.json
│       ├── AndroidApp.json_chunk_4096_000001.json
│       ├── AndroidApp.json_chunk_4096_000002.json
│       ├── AndroidApp.json_chunk_4096_000003.json
│       ├── AndroidApp.json_chunk_4096_000004.json

Some things we've noticed that we don't quite understand:

Having more training data even though we're intentionally trying to overfit doesn't always help. We tried chunk sizes of 128, 256, 512, 1024, 2048, 4096 and 10 copies of each which achieved 60x compression ratio on the payload. But then, tried again with only 2048 and 4096 and 1 copy of each, and got 90x compression ratio.
The max dict size is really sensitive, and doesn't just act as an upper bound. With 2048 & 4096 chunks with 1 copy each, a max dict size of 512KB gets 90x ratio but 550KB gets 14x ratio.

Thanks in advance for your time, we're really excited to roll this out soon and would love your insights on how we can do even better!

neiljohari commented 2 months ago

Hey @Cyan4973!

Apologies for the direct ping, just wondering if you had any insight on our questions & whether our process is sensible, or if you see room for improvement?

Thanks!

Cyan4973 commented 2 months ago

Hi @neiljohari ,

it's a complex use case, unfortunately, it will require more than one sentence to answer.

Our payloads currently are ~435kb JSON

That's a starting point. In general, 435 KB is supposed to be large enough to compress well without dictionary. Dictionary can still help a little bit, but generally, it's not worth the cost.

That being said, this statement is valid in general. Your use case might well be more specific, and the above statement wouldn't work. In which case, this is kind of uncharted territory (or at least something we don't encounter regularly on our side).

It would be relatively simple to manufacture a counter example: imagine a file which is hard to compress (let's say, it consists of encrypted segments), so on its own, it may be 500 KB, but it compresses very poorly. Then, you might have multiple files which are of the same kind, meaning they consist of about the same encrypted segments, that are repeated across files, but never within a file. In which case, assuming the quantity of segments is tractable, a dictionary will work wonders, and there will be a big compression difference between using a dictionary or not, even though the original file is "large" and therefore presumed to not benefit from dictionary.

While the scenario described above is a trope, it wouldn't be too hard to be in a somewhat related scenario. For example, we could imagine a very large template, with a lot of static content essentially repeated across documents, but not necessarily within the document, and just a few short answers being different (and even then, often similar).

And now, I wonder if that's what's happening for your use case.

One of the problems of the dictionary builder is that, that's not the use case it's aiming for. It's rather expecting some small segments present more or less randomly across multiple small inputs. That's very different from large sections being identical across large inputs. So the dictionary builder will make the mistake to cut these large sections into small segments, and introduce a lot of random noise (and lose a bunch of efficiency) in the process.

Hence, the "ideal" dictionary builder for this use case doesn't exist, or at least is not (yet) provided by this repository.

One suggestion could be to use one complete JSON file as a point of reference, that would become the dictionary. Assuming that most other files are highly related to this one, the common portions should be found during the compression process, it would make a sensible difference on compression ratio. Of course, decompression will have to use the same reference.

It's a relatively simple test, and should give some information on how to move forward with more ideas.

facebook / zstd

[Help Wanted, Questions] Improving Dictionary Training Process #4127