facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
22.77k stars 2.03k forks source link

How to accelerate the process of dictionary training in zstd? #4053

Open riyuejiuzhao opened 1 month ago

riyuejiuzhao commented 1 month ago

I am facing a time-consuming issue with zstd dictionary training when working with large datasets. The slow process has led me to search for ways to speed it up.

I would be grateful for any suggestions, code examples, or guidance on how to accelerate the dictionary training process using the zstd library. Although I want to use multithreading for dictionary training, I am unsure about how to implement it.

Thanks

Cyan4973 commented 1 month ago

-T0 will trigger multi-threading, scaling the nb of working threads to the nb of detected cores on the local system. It's active during training.

Training time is generally a function of training size. So if you want faster training, reduce the training sample size. If you don't what to do the selection work manually, use the --memory=# command, and the trainer will randomize its selection up to the requested amount.

There are several dictionary trainers available, and --train-fastcover is the faster one. It's enabled by default, and also features multiple advanced parameters, some of which can impact speed in major ways. Try --train-fastcover=accel=#, with # within [1,10]. It will trade accuracy for speed. Other advanced parameters exist, but can be harder to understand and employ.

riyuejiuzhao commented 1 month ago

-T0 will trigger multi-threading, scaling the nb of working threads to the nb of detected cores on the local system. It's active during training.

Training time is generally a function of training size. So if you want faster training, reduce the training sample size. If you don't what to do the selection work manually, use the --memory=# command, and the trainer will randomize its selection up to the requested amount.

There are several dictionary trainers available, and --train-fastcover is the faster one. It's enabled by default, and also features multiple advanced parameters, some of which can impact speed in major ways. Try --train-fastcover=accel=#, with # within [1,10]. It will trade accuracy for speed. Other advanced parameters exist, but can be harder to understand and employ.

Thank you very much for your help! I am actually using the Python interface of zstd to train dictionaries, and I tried setting the threads parameter, only to find that the training process entered optimization mode, which actually took even longer than the regular training.

I think the main issue is that the dataset is too large overall. I am curious about the principles behind dictionary training. Is it possible to split the entire dataset into smaller parts, train them separately, and then combine the results?

Cyan4973 commented 1 month ago

I think the main issue is that the dataset is too large overall. I am curious about the principles behind dictionary training. Is it possible to split the entire dataset into smaller parts, train them separately, and then combine the results?

Nope.

If your sample set is too large, your best option is to consider using --memory=# to limit the amount used for training.

riyuejiuzhao commented 1 month ago

Training time is generally a function of training size. So if you want faster training, reduce the training sample size. If you don't what to do the selection work manually, use the --memory=# command, and the trainer will randomize its selection up to the requested amount.

Thank you. If I want to delve deeper into the specific principles of the dictionary training process, such as debugging the libzstd source code with gdb, are there any resources or references that you could recommend for me to consult?

Cyan4973 commented 1 month ago

The source code itself points at a few resources, but beyond that, don't expect any tutorial to exist on the matter. These algorithms are fairly complex and rare. There isn't a CS corpus of knowledge developed around this topic yet.