Closed ashishkf closed 2 years ago
Basically rangeEq should return empty results by check if predicate is greater than eq to the max value.
I’m not sure there’s any value in keeping MutableBitSliceIndex now we have RangeBitmap, since RangeBitmap outperforms it for the same operations. RangeBitmap doesn’t explicitly support equality or inequality predicates but they would be easy to add. The former (eq(x)) can be emulated with between(x, x).
Does RangeBitmap support getValue(columnId/rowId) and setValue(columnId/rowId, value) APIs?
Not exactly, though getValue would be easy to implement. I appreciate you've reported a bug and this discussion is tangential, but what is your use case for these methods? It would be good to understand if it can be handled better by RangeBitmap since it's generally a lot faster than the bit slice index implementation.
Basically, the use case is same as what BitSlicedIndex's document talks about - storing values for high cardinality OLAP columns indexed by rowId.
If RoaringBitMap is more efficient and if it can address the above use case, I can try it out - including making changes if required. Can RoaringBitMap be adapted to do setValue/getValue use with non-sequential rowId?
RangeBitmap has a build then freeze lifecycle, which should usually be appropriate and I wouldn't want to see that change. Do you really need online mutations, if so fixing MutableBitSliceIndex would be a better course of action.
I also don't understand the use case for getValue
- when would something other than a bitmap (or an iterator) of row ids (or column ids if you are so inclined) satisfying a predicate not be desirable?
mutations, if so fixing MutableBitSliceIndex
Yes, we do need online mutations.
I also don't understand the use case for
getValue
- when would something other than a bitmap (or an iterator) of row ids (or column ids if you are so inclined) satisfying a predicate not be desirable?
predicate / rangeEq returning bitmap of row ids is good. However, we are using MutableBitSliceIndex to store high cardinality values for a given column and row by having one MutableBitSliceIndex per column and indexing the value by the rowid.
I have contributed a potential fix at https://github.com/RoaringBitmap/RoaringBitmap/pull/559
@richardstartin As you said, "RangeBitmap has a build then freeze lifecycle". which I think narrow the scope use of bsi ,especially in User Behavior Analysis.
If one app like Tiktok or Youtube which has more than 1B users visiting per day. And if we have user visiting logs like this.
TABLE : user_action_log | user_id | video_id | video_type | watch_duartion | watch_times | date_str |
---|---|---|---|---|---|---|
user_1 | 1 | sport | 20 | 1 | 20220505 | |
user_2 | 1 | sport | 25 | 2 | 20220505 | |
user_3 | 2 | news | 30 | 2 | 20220505 |
Most of times, we will do User Behavior Analysis like this.
SELECT date_str
,video_type
,SUM(watch_duartion)/COUNT(DISTINCT user_id) as avg_watch_duration
,SUM(watch_times)/COUNT(DISTINCT user_id) as avg_watch_times
FROM user_action_log
WHERE video_id =1
GROUP BY date_str,video_type
AS we konwn, if each user has 10 visit records per day. then this table will has more than 10B rows. which mean the query is expensive.
if we use MutableBitSliceIndex to compress this table , like this
video_id | video_type | date_str | watch_times_bsi | watch_duartion_bsi |
---|---|---|---|---|
1 | sport | 20220505 | bsi{user1->1,user_2->2} | bsi {user1->20,user_2->25} |
2 | news | 20220505 | bsi{user_3->2} | bsi{user3->30} |
Then, the demo query will become like this
SELECT date_str
,video_type
,bsi_sum_avg(watch_times_bsi) as avg_watch_duration
,bsi_sum_avg(watch_duartion_bsi) as avg_watch_times
FROM user_action_log
WHERE video_id =1
GROUP BY date_str,video_type
And even , If we have a crowd A: [user1,user3], we want to know the details visit log of them. we can query like this (Supported we use roaringbtimap to store crowd A)
SELECT video_id,video_type,user_id,watch_times
FROM user_action_log
WHERE video_id =1 and data = '20220505'
lateral view bsi_explore(watch_times_bsi,A) as(user_id,watch_times)
This method is from Kylin, which precompute cube (multidimensional index) to accelerate OLAP query. And As you know, If we want to do User Behavior Analysis for app which has more than 1B uses per day. OLAP using mpp architecture is expensive. Precomputing technology might be better.
Kylin use Roaringbitmap to store Mesure (refre to Kylin BitmpCounter), like this: | video_id | video_type | date_str | watch_times_user | watch_duartion_user |
---|---|---|---|---|---|
1 | sport | 20220505 | roarigbitmap[user1,user2] | roarigbitmap[user1,user2] | |
2 | news | 20220505 | roarigbitmap[user3] | roarigbitmap[user3] |
Obviousely, this structure lose the user visiting detail. It can only support MOLAP rollup, but not drill down. And that's why I implemented Bit-sliced-index.
In any case, please keep reporting the issues you find. If you can prepare reproducible test cases, it is even better.
@lemire I’d like to do. But these days, I’m focusing on implementing BSI/Rangbitmap for CRoring and developing extensions of Postgresql using them…
@Smallhi Sounds great.
@richardstartin : was following this thread , could you please elaborate on rangebitmap predicate restrictions a bit since i am new to this field . Also any pointers on how to enable such predicates
this is in regards to the statement above : RangeBitmap doesn’t explicitly support equality or inequality predicates
@patelprateek it has supported both of those predicates since #606
@richardstartin : thanks for the quick response , curios if this has been implemented in the cpp version as well here : https://github.com/RoaringBitmap/CRoaring
No this is not in CRoaring though I would be open to porting it. Could you outline your use case?
@richardstartin : In some of our production workloads we want to build additional bitmap based index for some quick filtering and generating a bitmap of eligible document ids. Typically for categorical attributes roaringbitmap is fine but for numerical attributes with high cardinality need to support some range based filtering i.e some_threshold > x .
The bitmaps are passed down to some other indexing structure where they are used for doing quick filtering while scanning , these consumers are typically in cpp language and the scale of documents in index can range from 100k to 10M , with latency requirements < 20-30 ms at P90 . Hence we are restricted to build this bitmap indexes natively in the same service , so that we dont have to pay extra cost on translation of ids (bitmaps needs to be build on same ids as the consumers) , serializing and network overheads or any jni/pythong bindings copies.
Also is it possible to use these bitmaps on floating point values . Currently our use are build some index snapshot and then serving does read only queries . In future we might need incremental builds , deletion , and updates support
It's possible to port the implementation to C++, and the serialisation format should be portable so it should support scenarios where one language is used to build the bitmap and another is used to consume it. I may get around to doing that within the next fortnight, if someone else wants to do it it might be a faster turnaround. 20-30ms for 10M is easily achievable in Java, the filtering at that row count should take microseconds.
Also is it possible to use these bitmaps on floating point values . Currently our use are build some index snapshot and then serving does read only queries . In future we might need incremental builds , deletion , and updates support
In Pinot (which is the only place this class is used and was the sole motivation for use within range indexes) floating point values are encoded with the order-preserving transformation here. I doubt Pinot was the first or only system to use this kind of order preserving encoding.
@richardstartin : good to know that even in java these are 20-30 ms. Yes most of our index building logic could come froma. different language. On the serving side due to high QPS and low latency requirement we have a cpp runtime for filter execution . Another question : we have queries like country = us and age > 30 and distance < 0.4 , it is possible to take unions and intersection between a normal bitmap and range bitmap . I read one of your blogs regading range bitmap example and found that it perfectly fits our use case.
I will take a look at the float order-preserving transformation as well. Thanks for all your help , really appreciate the response
Describe the bug MutableBitSliceIndex.rangeEq returns incorrect results when predicate is greater than max value.
To Reproduce Create a MutableBitSliceIndex, add a few values and then invoke rangeEq with predicate greater than the max value that was added. It should return wrong results.