Closed g1mv closed 7 months ago
@191919 how was 1.buf compressed exactly ? Was it with the same density version (the one in the dev branch) ? Do you have the original (ie uncompressed) data so I can try a roundtrip ?
Please uncompress the attached gz and check the file density_01263.bin with the following test:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "density/density_api.h"
int cc() {
uint8_t *dbuf = malloc(100000);
uint8_t *cbuf = malloc(100000);
FILE* fp = fopen("density_01263.bin", "rb");
fread(dbuf, 1, 32768, fp);
fclose(fp);
density_processing_result r;
r = density_compress(dbuf, 32768, cbuf, density_compress_safe_size(32768), DENSITY_ALGORITHM_CHAMELEON);
printf("compressed size=%d\n", (int) r.bytesWritten);
fp = fopen("1.buf", "wb");
fwrite(cbuf, 1, r.bytesWritten, fp);
fclose(fp);
free(dbuf);
free(cbuf);
return r.bytesWritten;
}
void dd(int clen) {
uint8_t *dbuf = malloc(100000);
uint8_t *cbuf = malloc(100000);
FILE* fp = fopen("1.buf", "rb");
fread(cbuf, 1, clen, fp);
fclose(fp);
density_decompress(cbuf, clen, dbuf, density_decompress_safe_size(65536));
free(dbuf);
free(cbuf);
}
int main() {
dd(cc());
return 0;
}
The output looked like this in my machine (not the one on which I wrote the previous post, but the reason of crash is the same):
$ cc -g -o 1 1.c density/*.c
$ lldb ./1
(lldb) target create "./1"
Current executable set to './1' (x86_64).
(lldb) r
Process 42001 launched: './1' (x86_64)
compressed size=22520
32768 22520 -8
Process 42001 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1007fffe0)
frame #0: 0x00007fff7383e193 libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 627
libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell:
-> 0x7fff7383e193 <+627>: vmovaps ymm2, ymmword ptr [rsi - 0x40]
0x7fff7383e198 <+632>: sub rsi, 0x40
0x7fff7383e19c <+636>: sub rdx, 0x40
0x7fff7383e1a0 <+640>: ja 0x7fff7383e180 ; <+608>
Target 0: (1) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x1007fffe0)
* frame #0: 0x00007fff7383e193 libsystem_platform.dylib`_platform_memmove$VARIANT$Haswell + 627
frame #1: 0x0000000100014041 1`density_chameleon_decode(state=0x00007ffeefbff7c8, in=0x00007ffeefbff7e8, in_size=22512, out=0x00007ffeefbff7e0, out_size=65792) at chameleon_decode.c:209
frame #2: 0x00000001000012f0 1`density_decompress_with_context(input_buffer="", input_size=22512, output_buffer="", output_size=65792, context=0x0000000100400000) at buffer.c:180
frame #3: 0x00000001000015df 1`density_decompress(input_buffer="", input_size=22520, output_buffer="", output_size=65792) at buffer.c:207
frame #4: 0x000000010000082b 1`dd(clen=22520) at 1.c:34
frame #5: 0x000000010000086b 1`main at 1.c:40
frame #6: 0x00007fff735b9115 libdyld.dylib`start + 1
frame #7: 0x00007fff735b9115 libdyld.dylib`start + 1
Great ! I'm able to reproduce now. Thank you !
@191919 it now works here with the latest dev branch :
build/benchmark -1 density_01263.bin
Single threaded in-memory benchmark powered by Density 0.15.0
Copyright (C) 2015 Guillaume Voirin
Built for MacOS (Little endian system, 64 bits) using Clang 9.1.0, Apr 3 2018 17:25:54
Allocated 80,448 bytes of in-memory work space
Chameleon algorithm
===================
Using file density_01263.bin copied in memory
Uncompressed and round-trip data hashes match. Starting main benchmark.
Round-tripping 32,768 bytes to 15,562 bytes (compression ratio 47.49% or 2.106x) and back
Compress speed 1485 MB/s (min 101 MB/s, max 2341 MB/s, best 0.0000s) <=> Decompress speed 1849 MB/s (min 135 MB/s, max 2731 MB/s, best 0.0000s)
Run time 10.000s (251319 iterations)
Released allocated memory.
@gpnuma Yes, it doesn't crash.
The top CPU consumer is still calloc
which eats up most of the time.
I have put the modified version to a proof-of-consistency cluster which receives data blobs, compresses to local storage files in one server and verifies by uncompressing and reading from another server. The test failed for several times (never when using LZ4 even after over 550TB).
I will try to add more code to find if there is a pattern block.
@191919 thanks, yes I can understand calloc
is still using a lot of CPU time as when reaching the big dictionary stage, clearing is still needed, but it is now progressive and fully integrated in the algorithm.
But I have an idea to speed this up : I might allow the use of a read-only dictionary (ie it does not learn during compression or decompression), since the datasets can be quite small and probably have a lot of similarities. It would allow for absolute speed as you would only instantiate the dictionary once and then use it with all processes at the same time, for compressing and decompressing.
To check if this could be feasible and efficient, your datasets would have to be consistent (ie there should not be text mixed with binary data in the same dataset for example). Do you think that is the case ?
BTW I have just pushed a stability update, hopefully the test inconsistencies should be sorted.
@gpnuma My dataset is mixed with text and binary data.
I will update to your latest source to run the test again.
My dataset is mixed with text and binary data.
In that case a readonly dictionary probably won't work efficiently (although it could be worth a test, if you have a sample dataset big enough), data needs to have at least some consistency for compression ratios to be interesting.
The top CPU consumer is still calloc which eats up most of the time.
Now that I think of it, this is really strange, because with the latest chameleon algorithm implementation the only memset
used is here :
https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/buffers/buffer.c#L102
And this zeroes only 32 bytes ! The slowdown can't come from there. Even the next line,
https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/buffers/buffer.c#L103
clears 256*8 = 2048 bytes only, that cannot possibly be the bottleneck I think, even if the compiler interprets it as a calloc
.
What happens now is that the actual total dictionary size is malloced
initially, but only the necessary parts (very small) are zeroed so as I said this calloc
bottleneck is very weird to me after second thought; even further zeroing - if deemed necessary by the algorithm's decision engine - shouldn't have any significant CPU impact and takes place here :
https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/algorithms/chameleon/core/chameleon_encode.h#L206
for the bitmasks (8192 bytes only on worst case) and here :
https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/algorithms/chameleon/core/chameleon_encode.h#L61
which anyways cannot be interpreted by the compiler as a calloc
as it involves a multiplication with a bitmask value.
@gpnuma I reran the Instruments utility.
In density_compress
:
+0x1c movq %r8, 8(%rsp)
+0x21 movq %rsi, 16(%rsp)
+0x26 movq %rdx, 24(%rsp)
+0x2b callq "DYLD-STUB$$malloc"
+0x30 movl %ebx, %edi
+0x32 movl %ebx, (%rax)
+0x34 movq %rax, %r13
+0x37 callq "density_get_dictionary_size"
+0x3c movl $1, %esi
+0x41 movb $0, 4(%r13)
+0x46 movq %rax, %rdi
+0x49 movq %rax, 8(%r13)
+0x4d callq "DYLD-STUB$$calloc"
+0x52 movl (%r13), %esi
+0x56 movq %rax, 16(%r13)
+0x5a movq 8(%rsp), %r8
+0x5f cmpl $1, %esi
+0x62 je "density_compress+0xb0"
+0x64 xorl %ebp, %ebp
+0x66 xorl %ebx, %ebx
+0x68 cmpq $7, %r8
+0x6c movl $2, %r12d
+0x72 ja "density_compress+0x118"
You can see the calloc
call in the offset of +0x4d
. It is what ate 94% of CPU time.
A good news is: after pulling your latest changes, my test cluster had processed over 30TB data without any inconsistency.
@191919 wow that's strange because it seems to be in the initialization phases, where 32 + 2048 bytes only are zeroed. I can't imagine that takes 94% CPU time, something else must be going on.
I honestly can't say more as when I run instrumenting here, the context allocation does not even take 1% CPU time, even with smallish datasets (few kB).
Just an error check, are you 100% sure you're using DENSITY_ALGORITHM_CHAMELEON
as the algorithm parameter ? There's an if/else
branch here that could lead to a much bigger and potentially performance influencing calloc if that's not the case :
https://github.com/centaurean/density/blob/369bf19c8efff55116d30e891d4a4e992cc8d7db/src/buffers/buffer.c#L100
@gpnuma It seems to be another gcc problem.
If I use gcc-7.3.0 to compile my program, calloc
is the top CPU eater, as you can see in the screenshot.
If I compile with clang, the CPU is relieved.
I will test it on more Linux servers with gcc-8.
@191919 today's update brings a small API change : now you are able to know via the context what was the initial size of the compressed data so it's way easier to setup an output buffer size.
It is accessible via the metadata struct here :
https://github.com/centaurean/density/blob/fdcf5dee376a379054148e3d44dfb03b6b987ce0/src/api.h#L89
Also, the method here :
https://github.com/centaurean/density/blob/fdcf5dee376a379054148e3d44dfb03b6b987ce0/src/api.h#L144
has changed name and has an extra algorithm parameter.
Other than that, it brings a speed bump and simplifies a case where dictionary initialization could be avoided. I don't think it will make much of a difference in regards to the above problem though.
The improved chameleon algorithm is starting to look good now. I'm soon going to start working on the two others.
In regards to the problematic calloc
, do you have a public repository where your test code is hosted ? I could try a test on my dev platform.
Apart from that, how does it work overall for you in regards to speed or ratio ?
@gpnuma Thanks for the update.
I don't have a public repository for now, all my codes are based on the LZ4 wrapper of percona server.
I will start a new branch of rsync
next week to synchronize TBs of highly compressible logs between servers, I do think density will bring meaningful performance boost in this application. I will let you know the progress and share the code.
Quoting @191919 :