Dictionary training performance anomaly, any answers?

xinglin commented 5 years ago

Hi,

zstd version: git commit@0dd3588accb239365a530dedbcfa649a5ecbda57.

I ran into a performance issue which does not seem easy to explain. I will post here, hoping someone can answer it. I used ZDICT_trainFromBuffer() and found the training took quite some time. Then, I thought I'd try ZDICT_optimizeTrainFromBuffer_fastCover() directly because I can set the compression level, rather than using the default level. Ideally, both implementations shouldn't give huge performance difference, since ZDICT_trainFromBuffer() is simply a wrap of ZDICT_optimizeTrainFromBuffer_fastCover().

// from lib/dictBuilder/zdict.c
size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity,
                             const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples)
{
    ZDICT_fastCover_params_t params;
    DEBUGLOG(3, "ZDICT_trainFromBuffer");
    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    /* Default to level 6 since no compression level information is available */
    params.zParams.compressionLevel = 3;
#if defined(DEBUGLEVEL) && (DEBUGLEVEL>=1)
    params.zParams.notificationLevel = DEBUGLEVEL;
#endif
    return ZDICT_optimizeTrainFromBuffer_fastCover(dictBuffer, dictBufferCapacity,
                                               samplesBuffer, samplesSizes, nbSamples,
                                               &params);
}

I largely copied lines from the above function and added them into my function. Here is what I changed in my code. I either commented out the first line or commented out the second line and the rest is the kept the same.

            //dictSize = ZDICT_optimizeTrainFromBuffer_fastCover(dictBuff, maxDictSize, inbuffer, sampleSizes, nbSamples, &params);
            dictSize = ZDICT_trainFromBuffer(dictBuff, maxDictSize, inbuffer, sampleSizes, nbSamples);
            //printf("dictSize=%lld\n", dictSize);
            cdict = ZSTD_createCDict(dictBuff, dictSize, cLevel);

I used 100 8K blocks as the sample set for training the dictionary.

With ZDICT_trainFromBuffer()

        compressed size=6627597125, compress ratio=5.18 X
        train time=102.11 secs, compress time=125.51 secs, compress throughput=143.96 MB/s
        Average train time=49.860739 msecs, average compress time=0.029924 msecs
        Number of trains=2048, Number of blocks=4194304

With ZDICT_optimizeTrainFromBuffer_fastCover()

        compressed size=5534618471, compress ratio=6.21 X
        train time=31.46 secs, compress time=109.68 secs, compress throughput=232.17 MB/s
        Average train time=15.360809 msecs, average compress time=0.026149 msecs
        Number of trains=2048, Number of blocks=4194304

My question is why the second case is 3X faster in training? I really don't understand this.

Cyan4973 commented 5 years ago

Also :

compression ratio is higher (x5.18 => x6.21)
compression speed is faster (125.51s => 109.68s)

So that's a lot of improvements...

terrelln commented 5 years ago

What parameters did you use for ZDICT_optimizeTrainFromBuffer_fastCover()?

xinglin commented 5 years ago

@terrelln they are largely the same as what are used in ZDICT_trainFromBuffer(). I set the compression level to what I will use in compression, rather than the default level (3). What compression level used did not seem to impact.

    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    params.zParams.compressionLevel = cLevel;

By the way, which paper describes the COVER algorithm? I found somewhere in the source code a while ago but could not find it anymore. I'd like to take a look to understand all these parameters.

terrelln commented 5 years ago

Setting the compression level is the difference. The optimize function selects the best parameters, k and d, though d doesn't really matter, so we fix it to 8 for speed. It will construct a dictionary with each parameter set, and tests the compressed size with the given compression level.

If you use the same level you use for compression, it will tune the dictionary better for your use case. Also, if you are using level 1, training the dictionary will be faster, since it is faster to test the compressed size.

The steps parameter tells the algorithm how many parameter sets to try, in this case we are trying 4 sets. The dictionary quality may improve if you allow more steps (like 40), but it will be proportionally slower. We've chosen a relatively low value for the default.

The COVER and FASTCOVER algorithms count frequency of each segment of length d, called a dmer, in the training corpus. Then it selects segments of length k, called a kmer, to insert into the dictionary. It scores kmers by summing the frequencies of each of the k - d + 1 dmers in the source. Then it picks the segment of the highest score, and sets the frequency of each dmer in the selected kmer to zero, since the dictionary already covers it. The process is repeated until the dictionary is full.

xinglin commented 5 years ago

@terrelln : Thanks for the explanation. When using different compression levels, we do see difference in training time. However, how to explain the huge performance difference in using ZDICT_trainFromBuffer() vs. using ZDICT_optimizeTrainFromBuffer_fastCover() directly?

Please see my experiments below.

Case (1): use compression level in training (cLevel == 1 in my experiments)

    // In my own code
    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    params.zParams.compressionLevel = cLevel;

Result:

        compressed size=5534618471, compress ratio=6.21 X
        train time=32.51 secs, compress time=114.67 secs, compress throughput=222.63 MB/s
        Average train time=15.875990 msecs, average compress time=0.027340 msecs
        Number of trains=2048, Number of blocks=4194304

Case (2): always set to level 3

    // In my own code
    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    params.zParams.compressionLevel = 3;

Result:

        compressed size=5595419632, compress ratio=6.14 X
        train time=35.15 secs, compress time=112.15 secs, compress throughput=222.46 MB/s
        Average train time=17.161250 msecs, average compress time=0.026739 msecs
        Number of trains=2048, Number of blocks=4194304

Case (3): change compression level to 1 in ZDICT_trainFromBuffer()

// in dictBuilder library
size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity,
                             const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples)
{
    ZDICT_fastCover_params_t params;
    DEBUGLOG(3, "ZDICT_trainFromBuffer");
    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    /* Default to level 6 since no compression level information is available */
    params.zParams.compressionLevel = 1;     // here, change from 3 to 1

Result:

        compressed size=6584890542, compress ratio=5.22 X
        train time=91.36 secs, compress time=127.28 secs, compress throughput=149.87 MB/s
        Average train time=44.608218 msecs , average compress time=0.030346 msecs
        Number of trains=2048, Number of blocks=4194304

Case (4): default library version (use level 3 in ZDICT_trainFromBuffer())

// in dictBuilder library
size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity,
                             const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples)
{
    ZDICT_fastCover_params_t params;
    DEBUGLOG(3, "ZDICT_trainFromBuffer");
    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    /* Default to level 6 since no compression level information is available */
    params.zParams.compressionLevel = 3;

Result:

        compressed size=6627597125, compress ratio=5.18 X
        train time=102.40 secs, compress time=127.18 secs, compress throughput=142.73 MB/s
        Average train time=50.001226 msecs, average compress time=0.030323 msecs
        Number of trains=2048, Number of blocks=4194304

terrelln commented 5 years ago

If you're calling ZDICT_optimizeTrainFromBuffer_fastCover() exactly the same way that ZDICT_trainFromBuffer() is, then I have no explanation for you.

Maybe you are using an older version of ZDICT_trainFromBuffer(), which used to call a different, slower function? It was changed in zstd-1.3.6 from ZDICT_optimizeTrainFromBuffer_cover() to ZDICT_optimizeTrainFromBuffer_fastCover().

xinglin commented 5 years ago

I am using 1.3.8 and it should be the fastCover implementation that is called from ZDICT_trainFromBuffer(). I am very confused with this as well, thus asking.

// from lib/zstd.h
#define ZSTD_VERSION_MAJOR    1
#define ZSTD_VERSION_MINOR    3
#define ZSTD_VERSION_RELEASE  8

// from lib/dictBuilder/zstd.c
size_t ZDICT_trainFromBuffer(void* dictBuffer, size_t dictBufferCapacity,
                             const void* samplesBuffer, const size_t* samplesSizes, unsigned nbSamples)
{
    ZDICT_fastCover_params_t params;
    DEBUGLOG(3, "ZDICT_trainFromBuffer");
    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    /* Default to level 6 since no compression level information is available */
    params.zParams.compressionLevel = 3;
#if defined(DEBUGLEVEL) && (DEBUGLEVEL>=1)
    params.zParams.notificationLevel = DEBUGLEVEL;
#endif
    return ZDICT_optimizeTrainFromBuffer_fastCover(dictBuffer, dictBufferCapacity,
                                               samplesBuffer, samplesSizes, nbSamples,
                                               &params);
}

terrelln commented 5 years ago

And you're passing exactly the same set of files, in exactly the same order, and getting different results?

xinglin commented 5 years ago

I think so. I am compressing a single file.

terrelln commented 5 years ago

The function is deterministic given the same parameters, same dictBufferCapacity, and the same files in the same order (samplesBuffer, samplesSizes, and nbSamples) are the same.

So there is either a bug in the function, or more likely a bug in your use case.

If you can provide the data you are compressing and a snippet of the code, I can rerun your test, and see if I can reproduce whats going on. Otherwise, I will look into reproducing it next week, but I don't think I'll be able to.

xinglin commented 5 years ago

I can not share the dataset. But I found the same performance anomaly also happens for other datasets as well. Try: https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.7.tar.xz

uncompress it first.

block size = 8K. nbsamples = 200. buffersize=16M. maxDictSize=128K.

/*
 * compress a file using a dictionary. A dictionary is created using the first 200 8K blocks for every read buffer.
 */ 
static void DictCompress(const char* fname)
{ 
    size_t fsize = fsize_orDie(fname), readSize = -1, dictSize = -1;
    FILE* inFile = fopen_orDie(fname, "rb");
    void* inbuffer = malloc_orDie(buffersize);
    void* dictBuff = malloc_orDie(maxDictSize);
    size_t sampleSizes[nbSamples];
    size_t dictSizeSum=0, cSizeSum=0;
    size_t const cBuffSize = ZSTD_compressBound(blocksize);
    void* cBuffer = malloc_orDie(cBuffSize);
    size_t cSize = -1;
    ZSTD_CCtx* const cctx = ZSTD_createCCtx();
    ZSTD_CDict* cdict;
    unsigned bufferid = 0;
    clock_t begin, end;
    double zstdDictTrainTime = 0.0, zstdDictCompressTime = 0.0;
    unsigned long trainTotal=0, blockTotal=0;
    ZDICT_fastCover_params_t params;

    if (cctx == NULL) {
        fprintf(stderr, "ZSTD_createCCtx() error\n");
        exit(10);
    }

    printf("zstd with a dictionary: inFile=%s, size=%zd\n", fname, fsize);

    memset(&params, 0, sizeof(params));
    params.d = 8;
    params.steps = 4;
    params.zParams.compressionLevel = cLevel;

    for(int i=0; i < nbSamples; i++) 
        sampleSizes[i] = 8192;

    while ( (readSize = fread_orDie(inbuffer, buffersize, inFile)) > 0) {
        // printf("buffer %u, size=%zd\n", bufferid, readSize);

        int blocknum = buffersize/blocksize;

        begin = clock();
        // compress with a dictionary
        // Use one or the other, to train a dictionary.
        //dictSize = ZDICT_optimizeTrainFromBuffer_fastCover(dictBuff, maxDictSize, inbuffer, sampleSizes, nbSamples, &params);
        dictSize = ZDICT_trainFromBuffer(dictBuff, maxDictSize, inbuffer, sampleSizes, nbSamples);
        if (ZDICT_isError(dictSize)) {
            fprintf(stderr, "dictionary training failed: %s\n", ZDICT_getErrorName(dictSize));
            exit(-1);
        }
        //printf("dictSize=%lld\n", dictSize);
        cdict = ZSTD_createCDict(dictBuff, dictSize, cLevel);
        if (!cdict) {
            fprintf(stderr, "ZSTD_createCDict error\n");
            exit(7);
        }
        end = clock();
        zstdDictTrainTime += (double)(end-begin)/CLOCKS_PER_SEC;
        dictSizeSum += dictSize;
        trainTotal++;

        begin = clock();
        for(int i=0; i < blocknum; i++) {
            cSize = ZSTD_compress_usingCDict(cctx, cBuffer, cBuffSize, inbuffer+i*blocksize, blocksize, cdict);
            if (ZSTD_isError(cSize)) {
                fprintf(stderr, "error compressing with dictionary %s : %s \n", fname, ZSTD_getErrorName(cSize));
                exit(8);
            }

            cSizeSum += cSize;
            blockTotal++;
        }

        end = clock();
        zstdDictCompressTime += (double)(end-begin)/CLOCKS_PER_SEC;

        bufferid ++;
    }

Results: with ZDICT_optimizeTrainFromBuffer_fastCover().

xing@atg-s-holder:~/w/taco/srctest$ ./compress -s 200 -c 2 /mnt/swap/linux/linux-5.0.7.tar 
            *** zstd ***
blocksize=8192, buffersize=16777216
samples=200
zstd with a dictionary: inFile=/mnt/swap/linux/linux-5.0.7.tar, size=863416320
[zstd compression with sampling-based dictionary]: 
  file=/mnt/swap/linux/linux-5.0.7.tar, inFileSize=863416320
        compressed size=231987086, dictionary size=2263974, compress ratio=3.72 X, (3.69 X, with dict)
        train time=1.21 secs, compress time=4.34 secs, compress throughput=148.36 MB/s
        Average train time=23.286404 msecs, average compress time=0.040745 msecs
        Number of trains=52, Number of blocks=106496

with trainFromBuffer()

xing@atg-s-holder:~/w/taco/srctest$ ./compress -s 200 -c 2 /mnt/swap/linux/linux-5.0.7.tar 
            *** zstd ***
blocksize=8192, buffersize=16777216
samples=200
zstd with a dictionary: inFile=/mnt/swap/linux/linux-5.0.7.tar, size=863416320
[zstd compression with sampling-based dictionary]: 
  file=/mnt/swap/linux/linux-5.0.7.tar, inFileSize=863416320
        compressed size=233661132, dictionary size=2263942, compress ratio=3.70 X, (3.66 X, with dict)
        train time=6.02 secs, compress time=4.44 secs, compress throughput=78.77 MB/s
        Average train time=115.704981 msecs, average compress time=0.041657 msecs
        Number of trains=52, Number of blocks=106496

terrelln commented 5 years ago

Thanks for the context! I'll try to reproduce it, and get back to you.

terrelln commented 5 years ago

ZDICT_optimizeTrainFromBuffer_fastCover() will update the parameters you pass in to indicate which parameters it selected during its optimization process. So after the first call, you are running with fixed parameters, meaning that it will be faster.

Since the optimization process doesn't seem to gain much, after the first run. You could run ZDICT_optimizeTrainFromBuffer_fastCover() with many steps (maybe 40). Then use those fixed parameters for the rest of the file, or at least for a few runs, since it is much faster.

Sorry for the confusing API, if I were to rewrite it I would have it return the new parameters in a returned struct. I don't expect too many users of the optimizing training functions, so I will consider changing it in zstd-1.4.1.

xinglin commented 5 years ago

Thanks @terrelln for answering my question! I think that is the real cause for this performance anomaly. I did not check that the first call to ZDICT_optimizeTrainFromBuffer_fastCover() takes a significantly longer time than the rest.

steps=4

zstd with a dictionary: inFile=/mnt/swap/linux/linux-5.0.7.tar, size=863416320
train          0: 1240.547000 msecs
train          1: 273.390000 msecs
train          2: 214.618000 msecs
train          3: 221.273000 msecs
...

steps=40

zstd with a dictionary: inFile=/mnt/swap/linux/linux-5.0.7.tar, size=863416320
train          0: 8065.819000 msecs
train          1: 252.403000 msecs
train          2: 208.043000 msecs
train          3: 216.096000 msecs
...

On a related note, how do we decide when to train a new dictionary? What techniques do people use, to decide that the incoming data is different from the previous samples that we need to train a new dictionary?

terrelln commented 5 years ago

We have a blog post on dictionary compression https://code.fb.com/core-data/zstandard/.

Most of our dictionary use cases group their data into "categories" and each category gets its own dictionary. We retrain dictionaries if we detect that we can improve compression ratio.

Generally, dictionaries get trained offline, then used online. But if you are training dictionaries online, which is now reasonably fast using ZDICT_trainFromBuffer_fastCover(), you may try something like retrain a dictionary every so often, and switch to the new dictionary if you find that compression ratio is improved by a large enough margin. Make sure to test the dictionary on different data you train it on though.

xinglin commented 5 years ago

Thanks for sharing your approach.

4aalon commented 3 years ago

Can someone explain please what is a segment, epoch, dmer and kmer and what is the relations between them?

facebook / zstd

Dictionary training performance anomaly, any answers? #1572