Merging dictionaries. - Githubissues

smccauliff commented 5 years ago

We data coming out of multiple Kafka partitions and would like to be able to build dictionaries in parallel and then merge each dictionary into a single dictionary. Sometimes the partitions don't exist in the same data center.

Is there a way to do this? Is it realistic to attempt an implementation of this feature?

Thanks, Sean

terrelln commented 5 years ago

Zstd dictionaries have a small header (< 500 bytes) and the rest is just content. Zstd tries to put the best content at the end of the dictionary. You could try training separate smaller dictionaries, and then just concatenating them together. As long as each individual dictionary is large enough, say 10 KB, then the result should be pretty good.

The problem with that approach is that you might end up extra redundancy in the dictionary. Another approach could be try and generate a full dictionary for each region, and make sure that you end up with >= 10x your target dictionary size, either by having >= 10 dictionaries, or making the first stage dictionaries larger. Then train a dictionary on the dictionaries. I'm not sure that it would work, but it might be worth a try.

Another method could be to try and detect repeated segments and only keep the last instance of the repeated segment. You'll have to tune the segment size to make it work well, and it shouldn't be too small.

If you find that one of these approaches works well, or take a different angle, please update me so I can think about adding better support natively!

smccauliff commented 5 years ago

Is the dictionary format documented or can you point me at the source code that defines the dictionary?

Cyan4973 commented 5 years ago

I would be cautious about this approach.

My own gut feeling is that it might be possible to create a dictionary from multiple separate sources by merging the internal state of dictionary builders, which contain a lot more information than the dictionary themselves, and therefore would prove a better starting point for the merge.

However, the dictionary builder has not been designed to export its state, let alone to merge them. This would be a new feature, and definitely a high investment one.

An alternative approach is to "sample" randomly each source : a dictionary is supposed to grab the most common redundancies. So, if they are common, it should not be difficult to find them. There's no need to send the whole partition for analysis, just grab a few thousands of random samples, and use this as a starting point for the dictionary builder. (note: random is important here, consecutive samples are not recommended, as they may feature some data locality which would not be generic, such as similar timestamp). This should represent a low amount of data which is suitable for transportation across multiple data centers.

Another aspect is that it's unclear for which gain or what purpose. Why a common dictionary for data sets which are apparently separated in their respective data centers ? From a compression ratio perspective, it seems it would be more efficient to specialize each dictionary for its own local source.

terrelln commented 5 years ago

https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format

smccauliff commented 5 years ago

There are some cases were a general dictionary created from different data sources might be a better solution. For example when a new data source becomes available, one that has not been seen before, which dictionary should it use? One possibility is to have some kind of fall back dictionary that has been generated from different sources. Scaling this dictionary creation would work better if multiple data sources can be consumed in parallel and then the resulting dictionaries merged into a final dictionary. Another use case is that there are some applications that need to consume from many different sources. These applications many have scale issues with holding multiple dictionaries. There is also the issue of scaling dictionary management.

I agree random sampling data sources is good idea. Temporal correlation is an issue not just for time stamps but different kinds of user activity will probably be correlated with different times of day.

This question is more of an exploratory question and I appreciate all of your feedback.

shakeelrao commented 5 years ago

@terrelln @Cyan4973 Slightly tangential, but are dictionaries being used in production at FB? If so, do you have any stats on how often a dictionary is recomputed (i.e. is there a notion of a stale dictionary)?

Cyan4973 commented 5 years ago

@shakeelrao , you might be interested in this article, where we detail our usage of Dictionary Compression and associated solution architecture at Facebook.

shakeelrao commented 5 years ago

Thanks, @Cyan4973! This article looks great.

terrelln commented 5 years ago

The performance of dictionaries degrades over time. The time it takes to degrade depends on the type of data. For example, we tested training dictionaries on web content and found that year over year dictionaries degrade around 5-10%. Other types of data degrade 10% in a month.

For example, a dictionary for web data may contain a common License file, like BSD, since it is included in many JS libraries. But in 2020 a new type of License becomes popular and BSD falls out of favor. Then the dictionary created now will not be as effective as a dictionary created in 2020.

Cyan4973 commented 5 years ago

Closing. Merging multiple dictionaries into one does not seem a practical approach.

facebook / zstd

Merging dictionaries. #1577