Building Online/Streaming Dictionary Compression in zstd

rohitagrawal10 commented 6 years ago

Hi,

I've been working on trying to build a model where we can try to build a dictionary in a streaming fashion. This was earlier asked in https://github.com/facebook/zstd/issues/513. To start this I've been trying to find places where the code works on the assumption that it has the entire corpus of samples available to it. When we try to build dictionaries in a streaming fashion we will need to work around this. Please note that I maybe wrong about my assumptions and interpretations of the code so please feel free to correct me.

I see 2 distinct phases:

Context Initialization
- This includes the partial suffix array creation, sorting and the frequency computation (groupby). This step needs the entire corpus of samples
Building the dictionary with the chosen k value.
- The entire corpus of samples is then chunked and a segment from each of these chunks(epochs) is then chosen to put in the dictBuffer.
I am proposing the following changes when we dont supply the entire corpus of samples to zstd but provide sample by sample through an API (for the time being let’s call it. ZSTD_addSampleToDict())
Context Initialization
- For each sample received, build a partial suffix array for that sample.
- Add each of these to a hashmap based on the first d bytes of the suffix.
- The key of the hashmap will be from the hashed d bytes of the suffix and the value will be the frequency of occurrence of this.
- In this way we dont need to sort and group by based on the first d bytes. The hashmap will perform that.
Dictionary Building. This step begins with chunking or creation of the epoch and selecting the best segment. I’m not 100% sure what to do in this step and would love to get some feedback on how we can work here knowing that we don’t have the entire corpus of samples available with us. I’m proposing a couple of suggestions/changes.
- Identify the top most frequently occurring dmers based on the hashmap we calculated earlier and keep adding it to the dict until we either exhaust the hashmap or we have filled up the buffer. This involves no chunking and hence there is no need for any k value.
- Whenever the user is giving us sample by sample using the ZSTD_addSampleToDict(), we concatenate it internally and then chunk it up at this stage and do exactly what ZSTD is currently doing.

Would love to hear some other suggestions to build this. I’m looking to submit a patch on this.

Rohit

terrelln commented 6 years ago

@jennifermliu will be working on a faster version of the COVER dictionary builder in the next few weeks that will be much easier to adapt to a streaming use case. Like you say we will have a hash map of first d bytes to frequency, but the hash will not distinguish collisions. You can then start selecting segments and updating frequencies on the fly, but you'll have a hard time determining how often to select segments unless you know the stream size in advance.

However, I propose a simpler approach. Dictionary training realistically only needs so much data. You could use reservoir sampling to pick X random samples uniformly from the stream, and then build the dictionary when the stream is over. Pick X such that you have somewhere in the range of 1 MB - 100 MB of data, you can test what works best. This method is also independent of the dictionary construction algorithm.

rohitagrawal10 commented 6 years ago

I don't fully understand the part where you say: You can then start selecting segments and updating frequencies on the fly. Here is what I dont understand about things working on the fly:

As far as I saw segments are selected from epochs and epochs are formed from the concatenated samples buffer which means that I cannot select the segments on the fly since I need access to the entire corpus of samples to build the epochs in the first place. In my proposed use case samples would come in one by one.
The selection of segments only happens after the entire frequency calculation of dmers is done from the samples. If this is the case then I need to wait for the frequency calculation of dmers to be done from all samples after which I can start selecting segments.

Regarding your simpler approach, I don't have any issues with the amount of data/samples to pick to build the dictionary. We have done evaluations by varying X and finding out what works best. The online dictionary building use case would be really in tune with our other APIs in the storage model and hence the need.

terrelln commented 6 years ago

I just mean that you could select segments before having all the frequency information.

You can also make the same API work with the sampling approach. The API would look like

void sampler_init(struct sampler* sampler, size_t numSamples);
void sampler_add(struct sampler* sampler, void* data, size_t size);
size_t sampler_get(struct sampler* sampler, void** data, size_t** sizes); // returns num samples
void sampler_destroy(struct sampler* sampler);

You would add each sample to the sampler, and when you are ready to build a dictionary you use sampler_get() to get the sampled data, and build the dictionary using the standard COVER algorithm (or any other algorithm). Would that work with your APIs / storage model?

rohitagrawal10 commented 6 years ago

Not really the problem is that this is not truly online and the sampler_get() step would cause a significant delay in the writers. This was also mentioned as a reason in #513 .

I would want some of the stuff that COVER does like freq calculation to happen in the sampler_add() step to make the dictionary building cost less expensive when it gets to calls like sampler_get() call.

In my initial proposal, Building hash map etc could be done online in the sampler_add() call as we keep getting the samples. What I'm really trying to understand is what to do in the next step of selecting segments etc so that we could either make it completely online or at least partially. If we can figure out that then we can allow users to have a series of calls like

ZSTD_dictInit()
ZSTD_addSampleToDict(sample1)
ZSTD_addSampleToDict(sample2)
ZSTD_addSampleToDict(sample3)
ZSTD_buildAndFinalizeDict()

I think we already have the 1st 2 figured out and can operate without the entire corpus of samples. The real question is can we have the 3rd call built in a way where it doesn't need access to the entire corpus of samples. If we can't, should ZSTD keep concatenating the samples it received in ZSTD_addSampleToDict() and then build the dictionary from the freq maps it has built after all the.ZSTD_addSampleToDict() calls

terrelln commented 6 years ago

Okay, I understand better now, thanks for the context! When you decide to build & finalize the dictionary, could to do that asynchronously while the writer is still using the old dictionary, and then switch over?

The new dictionary builder that @jennifermliu will be working on should be a good first step in this direction. We intend it to be a much faster version of the COVER dictionary that also uses less memory, in exchange for slightly less well tuned dictionaries. The first version won't have a streaming API, but it might be fast enough for you as is. After that, we can measure how long each step takes and consider a streaming version if computing the frequencies is a significant portion of the work.

I'll make sure to ping this issue when we have a candidate for the faster dictionary builder, so that you can test it out. I expect it to be ready in about a month.

rohitagrawal10 commented 6 years ago

When I decide to build & finalize I cannot switch over back and it cant be made asynchronous like you mentioned. I agree that the new dictionary builder is a good step in this direction. If this was not being planned, I would have been happy to help in building that. In building out the new dictionary builder as mentioned it would be super helpful if you could spend some time thinking how can subsequent calls to build&finalize be made online/streaming based on the new dictionary builder. Else we can always play with the new dict builder and start some discussion from there.

rohitagrawal10 commented 6 years ago

Hi, @terrelln Just following up. Is https://github.com/facebook/zstd/commit/9d6ed9def34fa042291d48637dabc695bfa5d23f the faster dictionary builder you were talking about ? Also does this affect compression/decompression throughput using dictionaries or only dictionary creation times ?

Cyan4973 commented 5 years ago

The new fast dictionary builder by @jennifermliu is now merged into dev branch, and has become the new "default" dictionary builder when using zstd --train.

facebook / zstd

Building Online/Streaming Dictionary Compression in zstd #1239