Closed rohitagrawal10 closed 5 years ago
@jennifermliu will be working on a faster version of the COVER dictionary builder in the next few weeks that will be much easier to adapt to a streaming use case. Like you say we will have a hash map of first d bytes to frequency, but the hash will not distinguish collisions. You can then start selecting segments and updating frequencies on the fly, but you'll have a hard time determining how often to select segments unless you know the stream size in advance.
However, I propose a simpler approach. Dictionary training realistically only needs so much data. You could use reservoir sampling to pick X random samples uniformly from the stream, and then build the dictionary when the stream is over. Pick X such that you have somewhere in the range of 1 MB - 100 MB of data, you can test what works best. This method is also independent of the dictionary construction algorithm.
I don't fully understand the part where you say: You can then start selecting segments and updating frequencies on the fly
. Here is what I dont understand about things working on the fly:
Regarding your simpler approach, I don't have any issues with the amount of data/samples to pick to build the dictionary. We have done evaluations by varying X and finding out what works best. The online dictionary building use case would be really in tune with our other APIs in the storage model and hence the need.
I just mean that you could select segments before having all the frequency information.
You can also make the same API work with the sampling approach. The API would look like
void sampler_init(struct sampler* sampler, size_t numSamples);
void sampler_add(struct sampler* sampler, void* data, size_t size);
size_t sampler_get(struct sampler* sampler, void** data, size_t** sizes); // returns num samples
void sampler_destroy(struct sampler* sampler);
You would add each sample to the sampler, and when you are ready to build a dictionary you use sampler_get()
to get the sampled data, and build the dictionary using the standard COVER algorithm (or any other algorithm). Would that work with your APIs / storage model?
Not really the problem is that this is not truly online and the sampler_get() step would cause a significant delay in the writers. This was also mentioned as a reason in #513 .
I would want some of the stuff that COVER does like freq calculation to happen in the sampler_add() step to make the dictionary building cost less expensive when it gets to calls like sampler_get() call.
In my initial proposal, Building hash map etc could be done online in the sampler_add() call as we keep getting the samples. What I'm really trying to understand is what to do in the next step of selecting segments etc so that we could either make it completely online or at least partially. If we can figure out that then we can allow users to have a series of calls like
ZSTD_dictInit()
ZSTD_addSampleToDict(sample1)
ZSTD_addSampleToDict(sample2)
ZSTD_addSampleToDict(sample3)
ZSTD_buildAndFinalizeDict()
I think we already have the 1st 2 figured out and can operate without the entire corpus of samples. The real question is can we have the 3rd call built in a way where it doesn't need access to the entire corpus of samples. If we can't, should ZSTD keep concatenating the samples it received in ZSTD_addSampleToDict() and then build the dictionary from the freq maps it has built after all the.ZSTD_addSampleToDict() calls
Okay, I understand better now, thanks for the context! When you decide to build & finalize the dictionary, could to do that asynchronously while the writer is still using the old dictionary, and then switch over?
The new dictionary builder that @jennifermliu will be working on should be a good first step in this direction. We intend it to be a much faster version of the COVER dictionary that also uses less memory, in exchange for slightly less well tuned dictionaries. The first version won't have a streaming API, but it might be fast enough for you as is. After that, we can measure how long each step takes and consider a streaming version if computing the frequencies is a significant portion of the work.
I'll make sure to ping this issue when we have a candidate for the faster dictionary builder, so that you can test it out. I expect it to be ready in about a month.
When I decide to build & finalize I cannot switch over back and it cant be made asynchronous like you mentioned. I agree that the new dictionary builder is a good step in this direction. If this was not being planned, I would have been happy to help in building that. In building out the new dictionary builder as mentioned it would be super helpful if you could spend some time thinking how can subsequent calls to build&finalize be made online/streaming based on the new dictionary builder. Else we can always play with the new dict builder and start some discussion from there.
Hi, @terrelln Just following up. Is https://github.com/facebook/zstd/commit/9d6ed9def34fa042291d48637dabc695bfa5d23f the faster dictionary builder you were talking about ? Also does this affect compression/decompression throughput using dictionaries or only dictionary creation times ?
The new fast dictionary builder by @jennifermliu is now merged into dev
branch, and has become the new "default" dictionary builder when using zstd --train
.
Hi,
I've been working on trying to build a model where we can try to build a dictionary in a streaming fashion. This was earlier asked in https://github.com/facebook/zstd/issues/513. To start this I've been trying to find places where the code works on the assumption that it has the entire corpus of samples available to it. When we try to build dictionaries in a streaming fashion we will need to work around this. Please note that I maybe wrong about my assumptions and interpretations of the code so please feel free to correct me.
I see 2 distinct phases:
Context Initialization
Building the dictionary with the chosen k value.
I am proposing the following changes when we dont supply the entire corpus of samples to zstd but provide sample by sample through an API (for the time being let’s call it. ZSTD_addSampleToDict())
Context Initialization
Dictionary Building. This step begins with chunking or creation of the epoch and selecting the best segment. I’m not 100% sure what to do in this step and would love to get some feedback on how we can work here knowing that we don’t have the entire corpus of samples available with us. I’m proposing a couple of suggestions/changes.
Would love to hear some other suggestions to build this. I’m looking to submit a patch on this.
Rohit