g1mv / density

Superfast compression library
Apache License 2.0
1.02k stars 48 forks source link

Problems with the density_context by multiple uses #72

Closed 191919 closed 7 months ago

191919 commented 6 years ago

I ran a profiling test which showed that calloc is the top CPU consumer. I saw in density_allocate_context, there is a line:

memset(context->dictionary, 0, context->dictionary_size);

(It turned out that gcc optimized the combination of x=malloc(y) and memset(x,0,y) into single calloc(x) call.)

So the problem was that every time I use the compress or decompress routine, density must clear (1<<16)(2+1)4 = 786432 bytes. That is quite a burden for all machines.

I see there is a density_compress_with_context function which seems to allow a pre-allocated context to be used by multiple times, but since the dictionary inside the context has been changed, the output (compressed buffer) cannot be decompressed correctly.

g1mv commented 6 years ago

Hello ! For compression and decompression to work properly the dictionary used must be in the same state when starting a compress/decompress session. On some platforms, if you do not zero the memory it can contain random data after a malloc and this leads to the problem described above i.e. dictionary states not being identical before compress and decompress. In the previous versions of density I had a preset zero dictionary set up once and for all in the stack and I copied its content every time the dict was initialized, but I seriously doubt this can be faster than an optimized memset function. As you point out rightly though, you can use your own custom context, so if you want to compress a lot of small datasets (lots of potential dictionary zeroing which cause slowdowns), what you could do is concatenate every 1000 datasets into one for example, and then compress the resulting concatenated dataset (hence only one dictionary init for every 1000 datasets so 1000 times less zeroing, as you would use the same context over and over). One of the advantages of this method, apart from speed, is that when the datasets are very similar in nature, the context dictionary is already matching the data and compression ratio improves as the algorithms' "learning curve" reduces. The drawbacks of this method is that you can't decompress a single dataset, you have to decompress a block of 1000 to access a single one, but with an intelligent cache system you can also greatly improve performance here.

191919 commented 6 years ago

I am trying to add density to InnoDB page compression algorithm list. Concatenation of data blocks is impossible. In fact, a lot of scenarios don't allow data concatenation, e.g., real time data or per-file transfer.

Would you please check the possibility of reducing the dictionary size?

g1mv commented 6 years ago

The main problem with a small dictionary size is that machine learning of dataset structures (unless they are very simple) will be less efficient, resulting in a decreased compression ratio. But I completely understand the problem you're facing, I'll have a look into it. Maybe there is a way to do memory zeroing much faster than memset or calloc which I'm unaware of ? Another option would be to calloc a lot of memory at once say 128MB for example (one big calloc should be faster than numerous small ones), and then use it fragment after fragment. Depending on the algorithm you're choosing it would easily make up for hundreds of uses. Or in the case of a multithreaded environment, you could have an idle thread preparing some memory for future use as dictionaries (zeroization in the background) ?

tarsa commented 6 years ago

Hi Guillaume, I think I have another solution :) Instead of shrinking dictionary size you can use flags to denote which dictionary entries are valid and which are possibly dirty. A bit flag consume 32x less than 32-bit value, thus you would need to only clear 24 KiB of flags at the beginning. The drawback is of course overhead of flags management. But there's also a solution - make the dirty (i.e. flag based) mode adaptive and after processing e.g. 20 KB of data zero all unused dictionary entries and switch to original mode that doesn't use flags.

g1mv commented 6 years ago

Hey Piotr ^^ Thanks, yes this is looking like a promising idea at first impression. I'll have to investigate and benchmark it but it would clearly greatly reduce the amount of mem zeroing on algo initialization. If one wants to compress/decompress a huge number of very small datasets though, this method will for sure improve the mem init speed but bitflag dictionary processing will always be active and alter algorithms speed, so in all cases there will be a performance hit for processing small datasets, which is not something desirable really. So maybe another idea would be to have a variable size dictionary, starting at a small size and then increasing in size as we reach a certain threshold of processed data. That way the initial dictionary size could be allocated and zeroed very quickly while still ensuring reasonable compression ratios for smallish datasets, and be augmented later on to its final size as the processed data quantity grows. The switch to the final, bigger dictionary could take place after a certain data amount is processed (for example after having processed 32kB), or after noticing the start of a degradation in compression ratio (would be very useful for low entropy datasets as the bigger dictionary would never be necessary, but in all other cases it incurs a temporary reduction in compression ratio before the new dictionary is setup).

191919 commented 6 years ago

I tried the approach of allocating a huge size of memory, using it as an array of memory block with size of each one is 768KB, managing it with a bitmap and clearing the dirty ones with another thread.

In the case of InnoDB page compression under a heavily loaded server, I have to allocate over 4GB memory to catch up the pace of memory demand, with the zeroing thread eating up 30% time of a core.

Perhaps the reduction of dictionary size is the final solution.

g1mv commented 6 years ago

@191919 , out of testing purposes, would you have a histogram or something equivalent of the data quantity against size you're using for testing ? ie for example 123 datasets < 1kB, 345 datasets >= 1kB and < 2kB, 234 datasets >= 2kB and < 3kB, etc. ? Thanks. That would help testing Piotr @tarsa 's idea and the variable dictionary size one.

191919 commented 6 years ago

@gpnuma As mentioned in my previous post, I am adding density as a candidate algorithm of InnoDB page compression, the page size can be specified when you create the table.

g1mv commented 6 years ago

Ok thanks I see !

tarsa commented 6 years ago

Hey Piotr ^^ Thanks, yes this is looking like a promising idea at first impression. I'll have to investigate and benchmark it but it would clearly greatly reduce the amount of mem zeroing on algo initialization. If one wants to compress/decompress a huge number of very small datasets though, this method will for sure improve the mem init speed but bitflag dictionary processing will always be active and alter algorithms speed, so in all cases there will be a performance hit for processing small datasets, which is not something desirable really.

If all data streams are below threshold then indeed density would always run with active bitflags management.

Dynamic buffer size sounds OK and should be faster than bitflag management before hitting threshold, but it has disadvantages:

The idea with dictionaries and bitflag management is that you can copy entries from a pre-built dictionary on first use instead of zeroing them. Pre-built dictionaries are most effective on small independent data blocks that share a lot of entries in dictionaries. Investigation is needed to assess whether pre-built dictionaries can greatly help improving compression ratio in InnoDB case.

I'm not sure what would be the overhead of bitflag management, though. If it's very high then it wouldn't be an interesting alternative. But given the possible advantages it's worth checking.

What do you think?

g1mv commented 6 years ago

Yes I'm definitely going to check it, the only drawback I can see is that some algorithms like chameleon are so fast that even a bit masking and test might hit performance in a non negligible way, but I'll let benchmarking be the judge.

decreased compression ratio due to smaller dictionary size at the beginning (but probably it won't be a big decrease if threshold is carefully chosen)

That's absolutely right, there will be more collisions after hashing. My idea is to initially consider groups of 2 bytes instead of 4 so dictionary entries are already twice as small, and use a dictionary with a lot less entries which could additionally make it fit in higher Lx caches to compensate the loss of speed incurred by doubling hashing ops. 2 byte entries gives a max of 65536 possibilities, I need to benchmark what's feasible but I'm pretty sure a 1024 or 2048 entry dictionary would give a reasonable compression ratio... to be tested. That would be 64 or 32 times entries less than the current dictionary for chameleon as an example, and would require 2x1024=2kB or 2x2048=4kB mem to be zeroed against the current 4x65536 = 256kB for chameleon. So 128 or 64 times faster.

possibly big lag when transferring data from small dictionary to big dictionary assuming you would be doing that - otherwise further compression ratio decrease would happen

That's definitely a problem. One solution I see for this is - if not using a prebuilt dictionary - performing an ascendant 256-byte zeroing and a big dictionary insert (if the hashed value is below the zeroing counter obviously) every let's say 256 bytes read during the initial "small dictionary phase", so in the beginning the big dictionary would get populated progressively without any impact on speed. By choosing carefully the frequency of big dictionary inserts, when we switch to the "big dictionary phase" it is already pre-populated with coherent data.

using prebuilt dictionaries would be not feasible or be less efficient

I don't really think unless I missed your point that this would be an issue. Instead of pre-building one big dictionary you could just as well pre-build a small and a big one.

I'd be glad to know your opinion on this, before I give all of the ideas a trial run, it's very possible I could have overlooked something ^^

tarsa commented 6 years ago

I don't really think unless I missed your point that this would be an issue. Instead of pre-building one big dictionary you could just as well pre-build a small and a big one.

But how to transfer modifications (done while compressing data before exceeding threshold) from small dictionary to big dictionary? Perhaps for chameleon it would be quick, but for mandala and lion you would need to reconcile queues of chunks in dictionary entries. That would probably be slower than redoing compression from scratch using big dictionary.

One solution I see for this is - if not using a prebuilt dictionary - performing an ascendant 256-byte zeroing and a big dictionary insert (if the hashed value is below the zeroing counter obviously) every let's say 256 bytes read during the initial "small dictionary phase", so in the beginning the big dictionary would get populated progressively without any impact on speed.

Neat solution. Could be the winner although you will have an extra branch here so benchmarks will tell the truth.

Lots of experimentation ahead :) On my side, I'm busy with a Rust rewrite (and improvement) of demixer ( https://encode.ru/threads/1671-Demixer-new-tree-based-bitwise-CM-codec-is-in-development ). Currently I'm working on tree handling and it's very tricky. Nobody said developing compression algorithms would be easy, though.

g1mv commented 6 years ago

Lots of experimentation ahead :) On my side, I'm busy with a Rust rewrite (and improvement) of demixer ( https://encode.ru/threads/1671-Demixer-new-tree-based-bitwise-CM-codec-is-in-development ). Currently I'm working on tree handling and it's very tricky. Nobody said developing compression algorithms would be easy, though.

Sounds like an interesting project, on the other end of the ratio/speed spectrum !

g1mv commented 6 years ago

A quick heads up as I've been working on this quite a bit these days. I think I have found an acceptable solution to the problem which also has the beneficial side effect of improving compression ratio for small datasets. It involves some of the ideas mentioned here, including using low memory footprint start dictionaries and bitmasking. The main point is that the performance seems to be very good still, although it's too early yet to know if everything will fit in nicely with the current algorithms logic. As soon as I've got a working code for the chameleon algorithm to start with, I'll publish it in the dev branch and notify it here so you can give it a try.

g1mv commented 6 years ago

A quick info/tracking on this, I had made good progress in that branch : https://github.com/gpnuma/density/tree/adaptative-engine. But a problem showed up a few days ago : during testing the gcc compiled code ran 4x slower than clang's. I know the main compiler for this project is clang but I think it would be unreasonable to publish code that cannot be correctly optimized by gcc as it is quite an ubiquitous compiler. So I started a question on SO here : https://stackoverflow.com/questions/49098453/gcc-compiled-code-performance-catastrophic-compared-to-clangs to try to find options/solutions with the community on what seems to be a broader problem. In the meanwhile I've found a workaround for the specific density optimization which led to the aforementioned problem, I'll implement it and push it in the forthcoming days.

191919 commented 6 years ago

I noticed the gcc/clang speed difference and added Intel C++ Compiler 2018 to the list, then I checked the executables with IDA.

The three compilers took completely different approaches for the loop unrolling of __builtin_memcpy.

icc ignores __builtin_memcpy, it uses __intel_fast_memcpy even the cb is known. This makes the executable the most slowly.

gcc-7.3 expands __builtin_memcpy to a reducing set of 4/2/1-byte-copy operations. It is inefficient for x86/x64 architecture when cb is no more than the length of the largest register (i.e., 64-bit for x64).

clang's __builtin_memcpy is what people expects from the meaning of __builtin_*: a combination of 4/2/1-byte-copy operations. And of course it brings the best speed.

P.S. gcc supports stringop-strategy for memcpy, but none of the strategies (no_stringop libcall rep_byte rep_4byte rep_8byte byte_loop loop unrolled_loop vector_loop) has the same semantics of clang's.

g1mv commented 6 years ago

Yes I was honestly very surprised that these __builtin_memcpy are not properly optimized in gcc, and it's even worse in icc apparently. I tried to look for the slowdown problem everywhere else in the code but there ! I would really have thought that they would have given much more attention to this, especially since constructs such as uint64_t value = *(uint64_t*)pointer; are supposed to be unsafe in comparison to uint64_t value = memcpy(&value, pointer, 8);

191919 commented 6 years ago

@gpnuma Why uint64_t value = *(uint64_t*)pointer; is unsafe in comparison to memcpy(&value, pointer, 8); ?

g1mv commented 6 years ago

The first thing that looks worrying is what happens if pointer is not aligned ?, but in the case of the above construct, maybe the compiler is clever and will replace it with a memcpy. I'd say the main problem though is that this kind of pointer cast can lead to undefined behavior because of the strict aliasing rule (this was quite informative for me : https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule) which basically postulates that pointers of different types can't point to the same location in memory (unless you use unions), and this statement is used by compilers to optimize code. Hence compiler optimizations could end up breaking your code, so I use this with great caution.

191919 commented 6 years ago

If there are millions of 8 bytes to copy, the alignment matters, but I don't think there will be performance impact if there are only 8 bytes.

In fact, both clang and gcc optimizes memcpy(a,b,8); to something like *(uint64_t*) a = *(uint64_t*) b; as possible.

The following test code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

const size_t kTotal = 1234567;

int main() {
    uint8_t* pd = malloc(kTotal);
    uint8_t* ps = malloc(kTotal);

    memset(ps, 2, kTotal);

    for (size_t i = 0; i < kTotal - 8; ++i) {
        uint64_t v;
        memcpy(&v, ps, 8);
        pd[i] = (v * i) >> 56;
        ps++;
    }

    uint64_t r = 1;
    for (size_t i = 0; i < kTotal - 8; ++i) {
        r += pd[i];
    }

    printf("%lld %d\n", r, pd[kTotal / 2]);
}

These 2 lines

        memcpy(&v, ps, 8);
        pd[i] = (v * i) >> 56;

will be generated as:

gcc:

`` text_startup:0000000100000B10 mov rdx, [rcx+rax] __text_startup:0000000100000B14 imul rdx, rax text_startup:0000000100000B18 shr rdx, 38h __text_startup:0000000100000B1C mov [rbx+rax], dl


clang:

text:0000000100000E36 mov rcx, [r14+rax] text:0000000100000E3A imul rcx, rax text:0000000100000E3E shr rcx, 38h text:0000000100000E42 mov [rbx+rax], cl

g1mv commented 6 years ago

If there are millions of 8 bytes to copy, the alignment matters, but I don't think there will be performance impact if there are only 8 bytes.

Well in a fast compression library typical usage there will probably be millions of 8 bytes copy. Hopefully ^^

In fact, both clang and gcc optimizes memcpy(a,b,8); to something like *(uint64_t*) a = *(uint64_t*) b; as possible.

I don't mean that the compiler does not end up with 8-byte memory copying the way you say. What I mean is using pointer casting to a different type inside your c code is considered hazardous because the strict aliasing postulate means the compiler will consider two different types of pointers to point to two different locations in memory for its optimizations (unless you shoot yourself in the foot optimization-wise and use -fno-strict-aliasing). If they both point to the same location then we are in the undefined behavior perimeter I reckon. Of course, when we use uint64_t value = *(uint64_t*)pointer; the pointer cast to a different type is only transient here so this probably does not apply, but if we were to use the casted (uint64_t*)pointer somewhere else we would have to be very careful. That's how I see things anyways 🤔

g1mv commented 6 years ago

I've just promoted a build to the dev branch improving the chameleon algorithm in the following ways :

This is bleeding edge code so there are almost certainly optimizations still to be done, nevertheless @191919 I'd be happy to know how this works out in your particular test case, before I propagate the improvements to the two other algorithms (only the chameleon algorithm is affected by the current changes). Thank you !

g1mv commented 6 years ago

@191919 moved our discussions to new, separate issues as this is one is getting too cluttered.