NVIDIA / spark-rapids-jni

RAPIDS Accelerator JNI For Apache Spark
Apache License 2.0
36 stars 64 forks source link

[FEA] Generating boolean column vector with roaring bitmap. #2113

Open liurenjie1024 opened 3 months ago

liurenjie1024 commented 3 months ago

Is your feature request related to a problem? Please describe. Deletion vector is a widely used technique used for efficiently filtering data in columnar storage.

Describe the solution you'd like

We need to implement a gpu kernel to generate a boolean vector efficiently, following java code demo the logic in cpu:

public static boolean[] filter(Roaring64Bitmap bitmap, int start, int end) {
  int len = end - start;
  boolean[] ret = new boolean[len];
  for(int i=0; i<len; i++) {
    ret[i] = bitmap.contains(start + i);
  }

  return ret;
}

Describe alternatives you've considered

Currently we use roaring bitmap due to it's efficient for compression and bitmap operations. We can use other gpu optimized bitmaps instead if someone could find one which is more gpu friendly.

liurenjie1024 commented 3 months ago

This is related to https://github.com/NVIDIA/spark-rapids/issues/10904

ttnghia commented 3 months ago

I don't think we have any GPU implementation for roaring bitmap yet.

pmixer commented 2 months ago

I have the experimental code porting parts of RoaringBitmap to CUDA, not a perfect fit for the need(only support int32 keys instead of int64 ones, the later is more challenging) and not open-sourced yet(still some perf optimization work in progress).

Just want to join in to continue the discussion.

Assuming we have it already, like in below:

public static boolean[] filter(CuRoaring64Bitmap gpu_bmp, int start, int end) {
  boolean[] ret = new boolean[len];
  gpu_bmp.contains_range(start, end, ret); // better to be a batched function call for efficiency
  return ret;
}

there're still some questions -

is 64 bits a hard requirement or 32 bits would also be fine sometimes(by remapping etc.)?

where would be gpu_bmp created and how would it be maintained?

any idea?

liurenjie1024 commented 2 months ago

gpu_bmp will be created by other cpu algorithm, and we can pass it to gpu with serialized format.

pmixer commented 2 months ago

gpu_bmp will be created by other cpu algorithm, and we can pass it to gpu with serialized format.

if it just serves as a on gpu bitmap based filter, seems sth worth to try by minimal engineering efforts.

pls let me check whether https://github.com/NVIDIA/cuCollections could provide the required functionality or not in next few days.

liurenjie1024 commented 2 months ago

if it just serves as a on gpu bitmap based filter, seems sth worth to try by minimal engineering efforts.

Currently we use roaring bitmap because it's highly efficient for compression and OR operation, which is quite important for previouse operation.

pmixer commented 2 months ago

if it just serves as a on gpu bitmap based filter, seems sth worth to try by minimal engineering efforts.

Currently we use roaring bitmap because it's highly efficient for compression and OR operation, which is quite important for previouse operation.

RoaringBitmap is popular and highly efficient, well, it's 64bits version seems could be further improved according to my limited knowledge https://mp.weixin.qq.com/s/tJQoNRZ5UDJ_IASZLlhB4Q.

Anyway, it does not matter that much as only a subset of roaring features required for this use case.

@liurenjie1024 could u pls elaborate on expected set size and indices distribution(I mean the raw integers feed into the set), and to be filtered range length(end - start) for the target application?

@mattahrens @ttnghia @winningsix @firestarman @sameerz FYI

liurenjie1024 commented 2 months ago

Hi, @pmixer

could u pls elaborate on expected set size and indices distribution(I mean the raw integers feed into the set), and to be filtered range length(end - start) for the target application?

The expected set size depends on actual customer's file size. Simple bitset maybe not practical since 8 million rows requires 1M size, and that's only for one file. We may have thousands of files.

pmixer commented 1 month ago

Well, thousands of files means few GBs mem needed, may be a waste but possible to have a try before further optimization.

Just FYI, currently there are https://github.com/NVIDIA/cuCollections/tree/dev/include/cuco/detail/trie/dynamic_bitset in cuCollection and https://nvidia.github.io/cccl/thrust/api/function_group__set__operations_1gacf5edd558fd96eee01b6425b3909c6b4.html in Thrust open-sourced for on-gpu set operations.