Open toien opened 2 years ago
Do you get good results with gzip?
I have no experience compressing roaring bitmaps with generic codecs... assuredly, the impact will be data specific...
Related:
Compressing JSON: gzip vs zstd https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/
thanks for fast reply!
i write a test (java/golang), populate roaringbitmap with random uint32 and serialize it, also using gzip compress it, but it tunrns out data almost not compressed.
// populate with random data
Random r = new Random();
RoaringBitmap rbm = new RoaringBitmap();
for (int i = 0; i < size; i++) {
long rValue = r.nextLong() & 0xffffffffL;
int casted = (int) rValue;
rbm.add(casted);
}
// dump to disk
ByteBuffer buffer = ByteBuffer.allocate(rbm.serializedSizeInBytes());
rbm.serialize(buffer);
Path path = Paths.get(filepath);
Files.write(path, buffer.array());
// compress and dump
Path cpath = Paths.get(compressedFilepath);
Files.write(cpath, new byte[0], StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);
try (GZIPOutputStream gos = new GZIPOutputStream(new FileOutputStream(cpath.toFile()))) {
gos.write(buffer.array());
}
here is result:
> ls -alht
-rw-r--r-- 1 worker staff 19M Mar 21 15:59 random-1000w.bin.gz
-rw-r--r-- 1 worker staff 20M Mar 21 15:59 random-1000w.bin
i am trying zstd
it seems that generic compress not fit for roaringbitmap
INFO: lz4 decompressed len: 20501162, compressed len:20574769
INFO: zstd decompressed len: 20501162, compressed len:20415315
Interestingly, it looks like lz4 makes things worse in your test!
RoaringFormatSpec : specification of the compressed-bitmap Roaring formats
roaring bitmaps are already a type of compression. therefore the entropy of the serialized data should already be rather close to the maximum (you can make an entropy graph for example using binwalk
) and compressing it once more won't yield a significant result
suppose i have multiple clients implemented in different languages(go, java...), and all of them need to download roaringbitmap from server through http.
when bitmap gets large(about 10 millions) and serialized binary data comes out about 20 MB, i think compress before sending may save a lot of transmission time.
i am trying use gzip. any suggestions about compressing?
thanks in advance