facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.74k stars 2.11k forks source link

real-time dictionary function #3610

Open ppsnumi opened 1 year ago

ppsnumi commented 1 year ago

Is your feature request related to a problem? Please describe. I hope zstd's dictionary creation/analysis function works in real time. The dictionary function operates only through analysis of existing data. This is not suitable for services that use zstd in real time.

Describe the solution you'd like It would be nice if the dictionary function managed in real time was added through other options of zstd. However, since many zstd processes can read/write a dictionary file at the same time, concurrency control seems very important.

Describe alternatives you've considered If the real-time dictionary is not supported, I will have to analyze the data the day before and apply the dictionary file the day before to the real-time data.

Additional context The real-time dictionary function doesn't work as well as the current dictionary function, nor is it as desirable, but it would be a great feature for fast compression/decompression that zstd seeks. In the beginning, even if it only operates in the form of continuously adding new data of the dictionary file, it is inefficient than the existing dictionary file structure, and the difficulty in producing this function seems to be lowered.

thank you

Cyan4973 commented 1 year ago

I don't understand what you mean by "real-time dictionary function".

The closest thing I can think of that could correspond to this definition is the streaming mode, where the sender and receiver synchronize a same state, and continuously update on this state, so that sending new data compresses better thanks for previously sent data.

Streaming mode exists, and is already well supported by libzstd (ZSTD_compressStream()).

ppsnumi commented 1 year ago

sorry.

Let's write in a little more detail. In my service environment, multiple users perform upload/download based on xml files. The xml file has many repetitive contents due to the xml tag in its structure. I want to compress/decompress this xml file in real time. zstd's dictionary function is a function of "zstd --train". To use zstd's dictionary function, it is necessary to perform analysis on the already existing compression target files.

In an environment where xml files are uploaded and downloaded by user requests in real time, the method of analyzing existing files does not seem suitable to me.

However, in my environment, using the dictionary function can achieve very large compression efficiency.

use dictionary:22K xml -> 2k xml no use dictionary : 22K xml -> 9K xml

The "real-time dictionary function" I'm talking about is that "--train" is performed simultaneously with zstd's compression operation to create/increase the dictionary file.

In a situation where many users continuously upload xml files, I want to increase the compression efficiency by generating a dictionary file by performing "--train" with the contents of the xml file at the same time as compression.

In the case of compressing one xml file, even if the analysis operation is performed, it may be inefficient compared to performing the analysis operation on several xml files later at once, but even if the dictionary file is somewhat inefficient It seems like it would be a nice feature.

The "real-time dictionary function" is after all When compressing, "--train" and "-D" operate at the same time, This function continuously updates the contents of the dictionary.

thank you

felixhandte commented 1 year ago

To answer your immediate question: this functionality does not exist.

Here's why: in order to decompress an object compressed with a dictionary, you need to present exactly the same dictionary that was used to compress that object*. In the scenario you're describing, where you are continuously updating the dictionary, you would need to store each and every changed version of the dictionary, so that you could decompress the individual record that used that particular version of the dictionary. This would be overwhelmingly expensive compared to the benefits--dictionaries are only worthwhile when you can amortize their cost over many compressions. This makes dictionary-based compression a fundamentally batched operation.

Unless your data is extremely temporally-correlated, I would expect less-than-perfectly up-to-date dictionaries to perform nearly as well. Perhaps you could do a training run once a day on all the samples uploaded in the previous day, and use that for the next day's compressions. (Or even once a week. That's what we do at Meta.)

* In theory... this could be avoided if you made the dictionary a prepend-only buffer (while preserving the header), but none of the training tools are set up to support that. And again, I don't think that would really buy you much.

ktsaou commented 1 year ago

Sorry guys. I deleted my message. I responded to the wrong thread.