RoaringBitmap / RoaringFormatSpec

Specification of the compressed-bitmap Roaring format
http://roaringbitmap.org/
Apache License 2.0
149 stars 14 forks source link

[question] any suggestion about compressing serialized data? #10

Open toien opened 2 years ago

toien commented 2 years ago

suppose i have multiple clients implemented in different languages(go, java...), and all of them need to download roaringbitmap from server through http.

when bitmap gets large(about 10 millions) and serialized binary data comes out about 20 MB, i think compress before sending may save a lot of transmission time.

i am trying use gzip. any suggestions about compressing?

thanks in advance

lemire commented 2 years ago

Do you get good results with gzip?

I have no experience compressing roaring bitmaps with generic codecs... assuredly, the impact will be data specific...

Related:

Compressing JSON: gzip vs zstd https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/

toien commented 2 years ago

thanks for fast reply!

i write a test (java/golang), populate roaringbitmap with random uint32 and serialize it, also using gzip compress it, but it tunrns out data almost not compressed.

// populate with random data
Random r = new Random();
RoaringBitmap rbm = new RoaringBitmap();

for (int i = 0; i < size; i++) {
  long rValue = r.nextLong() & 0xffffffffL;
  int casted = (int) rValue;
  rbm.add(casted);
}

// dump to disk
ByteBuffer buffer = ByteBuffer.allocate(rbm.serializedSizeInBytes());
rbm.serialize(buffer);

Path path = Paths.get(filepath);

Files.write(path, buffer.array());

// compress and dump
Path cpath = Paths.get(compressedFilepath);

Files.write(cpath, new byte[0], StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);

try (GZIPOutputStream gos = new GZIPOutputStream(new FileOutputStream(cpath.toFile()))) {
  gos.write(buffer.array());
}

here is result:

> ls -alht
-rw-r--r--   1 worker  staff    19M Mar 21 15:59 random-1000w.bin.gz
-rw-r--r--   1 worker  staff    20M Mar 21 15:59 random-1000w.bin

i am trying zstd

toien commented 2 years ago

it seems that generic compress not fit for roaringbitmap

INFO: lz4 decompressed len: 20501162, compressed len:20574769
INFO: zstd decompressed len: 20501162, compressed len:20415315
lemire commented 2 years ago

Interestingly, it looks like lz4 makes things worse in your test!

derlaft commented 1 year ago

RoaringFormatSpec : specification of the compressed-bitmap Roaring formats

roaring bitmaps are already a type of compression. therefore the entropy of the serialized data should already be rather close to the maximum (you can make an entropy graph for example using binwalk) and compressing it once more won't yield a significant result