Decompression error with dev branch

g1mv commented 6 years ago

Quoting @191919 :

@gpnuma

My compiler:

$ clang -v
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

It is the decompression routine that crashed, not the compression routine.

My test code is simple:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "density/density_api.h" 

int main() {
  uint8_t *cbuf = malloc(100000);
  uint8_t *dbuf = malloc(100000);
  FILE* fp = fopen("1.buf", "rb");
  fread(cbuf, 1, 22520, fp);
  fclose(fp);

  density_decompress(cbuf, 22520, dbuf, density_decompress_safe_size(65536));
  return 0;
}

Compiled with:

$ cc -O0 -g -o 1 1.c density/*.c

Then

$ lldb ./1
(lldb) target create "./1"
Current executable set to './1' (x86_64).
(lldb) r
Process 8868 launched: './1' (x86_64)
Process 8868 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1007ffff0)
    frame #0: 0x00007fff7b93b620 libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 544
libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem:
->  0x7fff7b93b620 <+544>: movaps xmm4, xmmword ptr [rsi - 0x40]
    0x7fff7b93b624 <+548>: sub    rsi, 0x40
    0x7fff7b93b628 <+552>: sub    rdx, 0x40
    0x7fff7b93b62c <+556>: ja     0x7fff7b93b600            ; <+512>
Target 0: (1) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1007ffff0)
  * frame #0: 0x00007fff7b93b620 libsystem_platform.dylib`_platform_memmove$VARIANT$Nehalem + 544
    frame #1: 0x0000000100013ea1 1`density_chameleon_decode(state=0x00007ffeefbff828, in=0x00007ffeefbff848, in_size=22512, out=0x00007ffeefbff840, out_size=65792) at chameleon_decode.c:208
    frame #2: 0x00000001000011a0 1`density_decompress_with_context(input_buffer="", input_size=22512, output_buffer="", output_size=65792, context=0x00000001002001f0) at buffer.c:180
    frame #3: 0x000000010000148f 1`density_decompress(input_buffer="", input_size=22520, output_buffer="", output_size=65792) at buffer.c:207
    frame #4: 0x000000010000071c 1`main at 1.c:14
    frame #5: 0x00007fff7b6ba115 libdyld.dylib`start + 1

(BTW, you can see why the flattening of directory hierarchy is so helpful either in command line or a Makefile.)

g1mv commented 6 years ago

@191919 how was 1.buf compressed exactly ? Was it with the same density version (the one in the dev branch) ? Do you have the original (ie uncompressed) data so I can try a roundtrip ?

191919 commented 6 years ago

density_01263.bin.gz

Please uncompress the attached gz and check the file density_01263.bin with the following test:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "density/density_api.h"

int cc() {
    uint8_t *dbuf = malloc(100000);
    uint8_t *cbuf = malloc(100000);
    FILE* fp = fopen("density_01263.bin", "rb");
    fread(dbuf, 1, 32768, fp);
    fclose(fp);

    density_processing_result r;
    r = density_compress(dbuf, 32768, cbuf, density_compress_safe_size(32768), DENSITY_ALGORITHM_CHAMELEON);
    printf("compressed size=%d\n", (int) r.bytesWritten);

    fp = fopen("1.buf", "wb");
    fwrite(cbuf, 1, r.bytesWritten, fp);
    fclose(fp);

    free(dbuf);
    free(cbuf);
    return r.bytesWritten;
}

void dd(int clen) {
    uint8_t *dbuf = malloc(100000);
    uint8_t *cbuf = malloc(100000);

    FILE* fp = fopen("1.buf", "rb");
    fread(cbuf, 1, clen, fp);
    fclose(fp);

    density_decompress(cbuf, clen, dbuf, density_decompress_safe_size(65536));
    free(dbuf);
    free(cbuf);
}

int main() {
    dd(cc());
    return 0;
}

The output looked like this in my machine (not the one on which I wrote the previous post, but the reason of crash is the same):

$ cc -g -o 1 1.c density/*.c
$ lldb ./1
(lldb) target create "./1"
Current executable set to './1' (x86_64).
(lldb) r
Process 42001 launched: './1' (x86_64)
compressed size=22520
32768 22520 -8
Process 42001 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1007fffe0)
    frame #0: 0x00007fff7383e193 libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 627
libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell:
->  0x7fff7383e193 <+627>: vmovaps ymm2, ymmword ptr [rsi - 0x40]
    0x7fff7383e198 <+632>: sub    rsi, 0x40
    0x7fff7383e19c <+636>: sub    rdx, 0x40
    0x7fff7383e1a0 <+640>: ja     0x7fff7383e180            ; <+608>
Target 0: (1) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1007fffe0)
  * frame #0: 0x00007fff7383e193 libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 627
    frame #1: 0x0000000100014041 1`density_chameleon_decode(state=0x00007ffeefbff7c8, in=0x00007ffeefbff7e8, in_size=22512, out=0x00007ffeefbff7e0, out_size=65792) at chameleon_decode.c:209
    frame #2: 0x00000001000012f0 1`density_decompress_with_context(input_buffer="", input_size=22512, output_buffer="", output_size=65792, context=0x0000000100400000) at buffer.c:180
    frame #3: 0x00000001000015df 1`density_decompress(input_buffer="", input_size=22520, output_buffer="", output_size=65792) at buffer.c:207
    frame #4: 0x000000010000082b 1`dd(clen=22520) at 1.c:34
    frame #5: 0x000000010000086b 1`main at 1.c:40
    frame #6: 0x00007fff735b9115 libdyld.dylib`start + 1
    frame #7: 0x00007fff735b9115 libdyld.dylib`start + 1

g1mv commented 6 years ago

Great ! I'm able to reproduce now. Thank you !

g1mv commented 6 years ago

@191919 it now works here with the latest dev branch :

build/benchmark -1 density_01263.bin 

Single threaded in-memory benchmark powered by Density 0.15.0
Copyright (C) 2015 Guillaume Voirin
Built for MacOS (Little endian system, 64 bits) using Clang 9.1.0, Apr  3 2018 17:25:54
Allocated 80,448 bytes of in-memory work space

Chameleon algorithm
===================
Using file density_01263.bin copied in memory
Uncompressed and round-trip data hashes match. Starting main benchmark.
Round-tripping 32,768 bytes to 15,562 bytes (compression ratio 47.49% or 2.106x) and back
Compress speed 1485 MB/s (min 101 MB/s, max 2341 MB/s, best 0.0000s) <=> Decompress speed 1849 MB/s (min 135 MB/s, max 2731 MB/s, best 0.0000s)    
Run time 10.000s (251319 iterations)

Released allocated memory.

191919 commented 6 years ago

@gpnuma Yes, it doesn't crash.

The top CPU consumer is still calloc which eats up most of the time.

I have put the modified version to a proof-of-consistency cluster which receives data blobs, compresses to local storage files in one server and verifies by uncompressing and reading from another server. The test failed for several times (never when using LZ4 even after over 550TB).

I will try to add more code to find if there is a pattern block.

g1mv commented 6 years ago

@191919 thanks, yes I can understand calloc is still using a lot of CPU time as when reaching the big dictionary stage, clearing is still needed, but it is now progressive and fully integrated in the algorithm. But I have an idea to speed this up : I might allow the use of a read-only dictionary (ie it does not learn during compression or decompression), since the datasets can be quite small and probably have a lot of similarities. It would allow for absolute speed as you would only instantiate the dictionary once and then use it with all processes at the same time, for compressing and decompressing. To check if this could be feasible and efficient, your datasets would have to be consistent (ie there should not be text mixed with binary data in the same dataset for example). Do you think that is the case ? BTW I have just pushed a stability update, hopefully the test inconsistencies should be sorted.

191919 commented 6 years ago

@gpnuma My dataset is mixed with text and binary data.

I will update to your latest source to run the test again.

g1mv commented 6 years ago

My dataset is mixed with text and binary data.

In that case a readonly dictionary probably won't work efficiently (although it could be worth a test, if you have a sample dataset big enough), data needs to have at least some consistency for compression ratios to be interesting.

The top CPU consumer is still calloc which eats up most of the time.

Now that I think of it, this is really strange, because with the latest chameleon algorithm implementation the only memset used is here : https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/buffers/buffer.c#L102 And this zeroes only 32 bytes ! The slowdown can't come from there. Even the next line, https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/buffers/buffer.c#L103 clears 256*8 = 2048 bytes only, that cannot possibly be the bottleneck I think, even if the compiler interprets it as a calloc. What happens now is that the actual total dictionary size is malloced initially, but only the necessary parts (very small) are zeroed so as I said this calloc bottleneck is very weird to me after second thought; even further zeroing - if deemed necessary by the algorithm's decision engine - shouldn't have any significant CPU impact and takes place here : https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/algorithms/chameleon/core/chameleon_encode.h#L206 for the bitmasks (8192 bytes only on worst case) and here : https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/algorithms/chameleon/core/chameleon_encode.h#L61 which anyways cannot be interpreted by the compiler as a calloc as it involves a multiplication with a bitmask value.

191919 commented 6 years ago

@gpnuma I reran the Instruments utility.

In density_compress:

+0x1c   movq                %r8, 8(%rsp)
+0x21   movq                %rsi, 16(%rsp)
+0x26   movq                %rdx, 24(%rsp)
+0x2b   callq               "DYLD-STUB$$malloc"
+0x30   movl                %ebx, %edi
+0x32   movl                %ebx, (%rax)
+0x34   movq                %rax, %r13
+0x37   callq               "density_get_dictionary_size"
+0x3c   movl                $1, %esi
+0x41   movb                $0, 4(%r13)
+0x46   movq                %rax, %rdi
+0x49   movq                %rax, 8(%r13)
+0x4d   callq               "DYLD-STUB$$calloc"
+0x52   movl                (%r13), %esi
+0x56   movq                %rax, 16(%r13)
+0x5a   movq                8(%rsp), %r8
+0x5f   cmpl                $1, %esi
+0x62   je                  "density_compress+0xb0"
+0x64   xorl                %ebp, %ebp
+0x66   xorl                %ebx, %ebx
+0x68   cmpq                $7, %r8
+0x6c   movl                $2, %r12d
+0x72   ja                  "density_compress+0x118"

You can see the calloc call in the offset of +0x4d. It is what ate 94% of CPU time.

A good news is: after pulling your latest changes, my test cluster had processed over 30TB data without any inconsistency.

g1mv commented 6 years ago

@191919 wow that's strange because it seems to be in the initialization phases, where 32 + 2048 bytes only are zeroed. I can't imagine that takes 94% CPU time, something else must be going on. I honestly can't say more as when I run instrumenting here, the context allocation does not even take 1% CPU time, even with smallish datasets (few kB). Just an error check, are you 100% sure you're using DENSITY_ALGORITHM_CHAMELEON as the algorithm parameter ? There's an if/else branch here that could lead to a much bigger and potentially performance influencing calloc if that's not the case : https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/buffers/buffer.c#L100

191919 commented 6 years ago

@gpnuma It seems to be another gcc problem.

If I use gcc-7.3.0 to compile my program, calloc is the top CPU eater, as you can see in the screenshot.

If I compile with clang, the CPU is relieved.

I will test it on more Linux servers with gcc-8.

g1mv commented 6 years ago

@191919 today's update brings a small API change : now you are able to know via the context what was the initial size of the compressed data so it's way easier to setup an output buffer size. It is accessible via the metadata struct here : https://github.com/centaurean/density/blob/fdcf5dee376a379054148e3d44dfb03b6b987ce0/src/api.h#L89 Also, the method here : https://github.com/centaurean/density/blob/fdcf5dee376a379054148e3d44dfb03b6b987ce0/src/api.h#L144 has changed name and has an extra algorithm parameter. Other than that, it brings a speed bump and simplifies a case where dictionary initialization could be avoided. I don't think it will make much of a difference in regards to the above problem though. The improved chameleon algorithm is starting to look good now. I'm soon going to start working on the two others. In regards to the problematic calloc, do you have a public repository where your test code is hosted ? I could try a test on my dev platform. Apart from that, how does it work overall for you in regards to speed or ratio ?

191919 commented 6 years ago

@gpnuma Thanks for the update.

I don't have a public repository for now, all my codes are based on the LZ4 wrapper of percona server.

I will start a new branch of rsync next week to synchronize TBs of highly compressible logs between servers, I do think density will bring meaningful performance boost in this application. I will let you know the progress and share the code.

g1mv / density

Decompression error with dev branch #79