RoaringBitmap / CRoaring

Roaring bitmaps in C (and C++), with SIMD (AVX2, AVX-512 and NEON) optimizations: used by Apache Doris, ClickHouse, and StarRocks
http://roaringbitmap.org/
Other
1.57k stars 269 forks source link

more doc about the frozen format support #352

Closed niyue closed 2 years ago

niyue commented 2 years ago

I am currently using croaring 0.3.4, and I saw 0.4.0 was released previously, with the support of frozen format added.

I checked this PR and issue https://github.com/RoaringBitmap/CRoaring/issues/200, but I am not sure when I should use this frozen format. I know this set is now frozen, but is there any other benefit because of this change?

For example, in my project, roaring bitmap is serialized and stored on disk, and we do not need to modify the roaring bitmap after serialization/persistence onto disk, and only need to deserialize the roaring bitmap from disk and read its content. Is this kind of use case suitable for using the frozen format feature and will it bring any advantage compared to the current usage (like less storage cost maybe?)

It will be great if we could add some documentation briefly explain when the frozen format could be used and what the advantage it will bring. Thanks so much.

lemire commented 2 years ago

I am not eager to document it publicly more than we already do because it is strictly for expert users. It is also system specific (e.g., usage under Windows will be quite different from the usage under Linux).

For example, in my project, roaring bitmap is serialized and stored on disk, and we do not need to modify the roaring bitmap after serialization/persistence onto disk, and only need to deserialize the roaring bitmap from disk and read its content.

It might be, yes, if you are relying on memory mapping.

The expectation here is that you are already familiar with memory-mapped data structures, their benefits and limitations. It has upsides and downsides, and it requires care to work properly. As I stated, how to use it well depends on your OS and its configuration.

By default, in C and C++, we do not use memory mapped data structure. It is a specialized use case.

lemire commented 2 years ago

If you go down this route, you inherit a lot of responsibilities. The code is not aware that the bitmap is frozen. It is your responsibility to deal with all the consequences.

It is strictly for expert users and generally discouraged.

niyue commented 2 years ago

@lemire thanks so much for the explanation.

Could you please give more details about the impact of frozen roaring bitmap if the program relying on memory mapping?

My program does rely on memory mapping, for the part of data about roaring bitmap, it only treats the serialized roaring bitmap as a binary stream and reads it as a whole instead of random accessing part of the memory mapped file. This is because I didn't find any documentation about memory mapped a roaring bitmap previously and is not aware of this kind of usage.

P.S.: I did some web search and found Java's Roaring Bitmap implementation has some documentation about memory mapped roaring bitmap and range bitmap (https://github.com/RoaringBitmap/RoaringBitmap#range-bitmaps), both are pretty interesting and useful features to me. But I don't find any documentation in croaring about them.

lemire commented 2 years ago

@niyue There is no frozen format in Java. It is just the standard format. And we have a full API dedicated to supporting immutable bitmaps in Java, which we do not have in C. Rather we have a contributed extension.

lemire commented 2 years ago

It is deliberately undocumented and hidden because it is too rough an implementation for me to support it. If you’d like to work on it, making it more robust and better documented, that would be great.

lemire commented 2 years ago

The Java component you refer to is used in production, it has been thoroughly tested and it is documented.

The frozen format in this library is an experimental feature. I am not even sure it is particularly sane. I would not use it in production myself in its current state.

I do not like that it relies on a distinct binary format. And there might be hidden traps.

Our tests are minimal.

niyue commented 2 years ago

@lemire Got it. Thanks so much for the detailed explanation. I will keep an eye on it and will do some more comparison with Java's implementation to see if I can help on anything. Thanks again.

lemire commented 2 years ago

What I’d like to see is a version of the Java approach.

The « need » for a distinct format in C is somewhat ugly. It is not fundamental in the least.

A better way would be to focus on a specific set of modern compilers and systems.

andreas commented 1 year ago

@lemire, can you expand on what you mean by "the Java approach"? What's the best way to contribute to making the frozen format (or an alternative approach) better?

lemire commented 1 year ago

@andreas In Java, we have a single format (the documented portable format) and we can memory-map it. See https://github.com/RoaringBitmap/RoaringBitmap In particular, the ImmutableRoaringBitmap class. There is no frozen format, nor is there any need for it. The frozen format that was introduced in CRoaring has to do with the fact that it is believed to be unsafe to load a misaligned value (e.g., load a 4-byte integer that is not aligned on a 4-byte boundary). But, of course, that hasn't be true in a long time. So we are padding the format to accommodate systems that virtually nobody is using anymore. In theory, a C compiler can blow up your house if you do an unaligned load, but compilers that people actually use don't do that.

andreas commented 1 year ago

@lemire, thanks for clarifying!

Do I understand correctly, that it should be possible to implement a function roaring_bitmap_t *roaring_bitmap_portable_deserialize_frozen(const char *buf), where buf contains a bitmap in the portable serialized format, and the returned bitmap then has the ROARING_FLAG_FROZEN flag set?

Do you think we can expect performance comparable to roaring_bitmap_frozen_view? For our use case, we see roaring_bitmap_frozen_view being 2x-10x faster than roaring_bitmap_deserialize through roaring-wasm. This is critical for our application, as we currently spend more time deserializing bitmaps than performing bitmap operations themselves.

lemire commented 1 year ago

Do I understand correctly, that it should be possible to implement a function roaring_bitmap_t roaring_bitmap_portable_deserialize_frozen(const char buf), where buf contains a bitmap in the portable serialized format, and the returned bitmap then has the ROARING_FLAG_FROZEN flag set?

Yes. We were doing the equivalent in Java before the frozen format was introduced in C. There is nothing magical about the frozen format.

To be clear, the motivation is there... it is clearly useful to be able to just map your bitmaps.

andreas commented 1 year ago

I've made an attempt at an implementation of roaring_bitmap_portable_deserialize_frozen here: https://github.com/RoaringBitmap/CRoaring/pull/421

lemire commented 1 year ago

@andreas Great news!!!

lemire commented 1 year ago

This issue remains open because we don't have good documentation yet. PR invited.