RoaringBitmap / RoaringBitmap

A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others
http://roaringbitmap.org/
Apache License 2.0
3.5k stars 547 forks source link

MutableBitSliceIndex.rangeEq returns wrong results when predicate is greater than max value #549

Closed ashishkf closed 2 years ago

ashishkf commented 2 years ago

Describe the bug MutableBitSliceIndex.rangeEq returns incorrect results when predicate is greater than max value.

To Reproduce Create a MutableBitSliceIndex, add a few values and then invoke rangeEq with predicate greater than the max value that was added. It should return wrong results.

ashishkf commented 2 years ago

Basically rangeEq should return empty results by check if predicate is greater than eq to the max value.

richardstartin commented 2 years ago

I’m not sure there’s any value in keeping MutableBitSliceIndex now we have RangeBitmap, since RangeBitmap outperforms it for the same operations. RangeBitmap doesn’t explicitly support equality or inequality predicates but they would be easy to add. The former (eq(x)) can be emulated with between(x, x).

ashishkf commented 2 years ago

Does RangeBitmap support getValue(columnId/rowId) and setValue(columnId/rowId, value) APIs?

richardstartin commented 2 years ago

Not exactly, though getValue would be easy to implement. I appreciate you've reported a bug and this discussion is tangential, but what is your use case for these methods? It would be good to understand if it can be handled better by RangeBitmap since it's generally a lot faster than the bit slice index implementation.

ashishkf commented 2 years ago

Basically, the use case is same as what BitSlicedIndex's document talks about - storing values for high cardinality OLAP columns indexed by rowId.

If RoaringBitMap is more efficient and if it can address the above use case, I can try it out - including making changes if required. Can RoaringBitMap be adapted to do setValue/getValue use with non-sequential rowId?

richardstartin commented 2 years ago

RangeBitmap has a build then freeze lifecycle, which should usually be appropriate and I wouldn't want to see that change. Do you really need online mutations, if so fixing MutableBitSliceIndex would be a better course of action.

richardstartin commented 2 years ago

I also don't understand the use case for getValue - when would something other than a bitmap (or an iterator) of row ids (or column ids if you are so inclined) satisfying a predicate not be desirable?

ashishkf commented 2 years ago

mutations, if so fixing MutableBitSliceIndex

Yes, we do need online mutations.

ashishkf commented 2 years ago

I also don't understand the use case for getValue - when would something other than a bitmap (or an iterator) of row ids (or column ids if you are so inclined) satisfying a predicate not be desirable?

predicate / rangeEq returning bitmap of row ids is good. However, we are using MutableBitSliceIndex to store high cardinality values for a given column and row by having one MutableBitSliceIndex per column and indexing the value by the rowid.

lemire commented 2 years ago

I have contributed a potential fix at https://github.com/RoaringBitmap/RoaringBitmap/pull/559

Smallhi commented 2 years ago

@richardstartin As you said, "RangeBitmap has a build then freeze lifecycle". which I think narrow the scope use of bsi ,especially in User Behavior Analysis.

If one app like Tiktok or Youtube which has more than 1B users visiting per day. And if we have user visiting logs like this.

TABLE : user_action_log user_id video_id video_type watch_duartion watch_times date_str
user_1 1 sport 20 1 20220505
user_2 1 sport 25 2 20220505
user_3 2 news 30 2 20220505

Most of times, we will do User Behavior Analysis like this.

    SELECT date_str
                  ,video_type
                 ,SUM(watch_duartion)/COUNT(DISTINCT user_id) as avg_watch_duration
                 ,SUM(watch_times)/COUNT(DISTINCT user_id) as avg_watch_times
    FROM   user_action_log
    WHERE  video_id =1
   GROUP BY  date_str,video_type

AS we konwn, if each user has 10 visit records per day. then this table will has more than 10B rows. which mean the query is expensive.

if we use MutableBitSliceIndex to compress this table , like this

video_id video_type date_str watch_times_bsi watch_duartion_bsi
1 sport 20220505 bsi{user1->1,user_2->2} bsi {user1->20,user_2->25}
2 news 20220505 bsi{user_3->2} bsi{user3->30}

Then, the demo query will become like this

    SELECT date_str
                  ,video_type
                 ,bsi_sum_avg(watch_times_bsi) as avg_watch_duration
                 ,bsi_sum_avg(watch_duartion_bsi) as avg_watch_times
    FROM   user_action_log
    WHERE  video_id =1
    GROUP BY  date_str,video_type

And even , If we have a crowd A: [user1,user3], we want to know the details visit log of them. we can query like this (Supported we use roaringbtimap to store crowd A)

     SELECT video_id,video_type,user_id,watch_times
     FROM   user_action_log
    WHERE  video_id =1  and data = '20220505'
      lateral view bsi_explore(watch_times_bsi,A) as(user_id,watch_times) 

This method is from Kylin, which precompute cube (multidimensional index) to accelerate OLAP query. And As you know, If we want to do User Behavior Analysis for app which has more than 1B uses per day. OLAP using mpp architecture is expensive. Precomputing technology might be better.

Kylin use Roaringbitmap to store Mesure (refre to Kylin BitmpCounter), like this: video_id video_type date_str watch_times_user watch_duartion_user
1 sport 20220505 roarigbitmap[user1,user2] roarigbitmap[user1,user2]
2 news 20220505 roarigbitmap[user3] roarigbitmap[user3]

Obviousely, this structure lose the user visiting detail. It can only support MOLAP rollup, but not drill down. And that's why I implemented Bit-sliced-index.

lemire commented 2 years ago

In any case, please keep reporting the issues you find. If you can prepare reproducible test cases, it is even better.

Smallhi commented 2 years ago

@lemire I’d like to do. But these days, I’m focusing on implementing BSI/Rangbitmap for CRoring and developing extensions of Postgresql using them…

lemire commented 2 years ago

@Smallhi Sounds great.

patelprateek commented 10 months ago

@richardstartin : was following this thread , could you please elaborate on rangebitmap predicate restrictions a bit since i am new to this field . Also any pointers on how to enable such predicates this is in regards to the statement above : RangeBitmap doesn’t explicitly support equality or inequality predicates

richardstartin commented 10 months ago

@patelprateek it has supported both of those predicates since #606

patelprateek commented 10 months ago

@richardstartin : thanks for the quick response , curios if this has been implemented in the cpp version as well here : https://github.com/RoaringBitmap/CRoaring

richardstartin commented 10 months ago

No this is not in CRoaring though I would be open to porting it. Could you outline your use case?

patelprateek commented 10 months ago

@richardstartin : In some of our production workloads we want to build additional bitmap based index for some quick filtering and generating a bitmap of eligible document ids. Typically for categorical attributes roaringbitmap is fine but for numerical attributes with high cardinality need to support some range based filtering i.e some_threshold > x . The bitmaps are passed down to some other indexing structure where they are used for doing quick filtering while scanning , these consumers are typically in cpp language and the scale of documents in index can range from 100k to 10M , with latency requirements < 20-30 ms at P90 . Hence we are restricted to build this bitmap indexes natively in the same service , so that we dont have to pay extra cost on translation of ids (bitmaps needs to be build on same ids as the consumers) , serializing and network overheads or any jni/pythong bindings copies.
Also is it possible to use these bitmaps on floating point values . Currently our use are build some index snapshot and then serving does read only queries . In future we might need incremental builds , deletion , and updates support

richardstartin commented 10 months ago

It's possible to port the implementation to C++, and the serialisation format should be portable so it should support scenarios where one language is used to build the bitmap and another is used to consume it. I may get around to doing that within the next fortnight, if someone else wants to do it it might be a faster turnaround. 20-30ms for 10M is easily achievable in Java, the filtering at that row count should take microseconds.

Also is it possible to use these bitmaps on floating point values . Currently our use are build some index snapshot and then serving does read only queries . In future we might need incremental builds , deletion , and updates support

In Pinot (which is the only place this class is used and was the sole motivation for use within range indexes) floating point values are encoded with the order-preserving transformation here. I doubt Pinot was the first or only system to use this kind of order preserving encoding.

patelprateek commented 10 months ago

@richardstartin : good to know that even in java these are 20-30 ms. Yes most of our index building logic could come froma. different language. On the serving side due to high QPS and low latency requirement we have a cpp runtime for filter execution . Another question : we have queries like country = us and age > 30 and distance < 0.4 , it is possible to take unions and intersection between a normal bitmap and range bitmap . I read one of your blogs regading range bitmap example and found that it perfectly fits our use case.

I will take a look at the float order-preserving transformation as well. Thanks for all your help , really appreciate the response