[ENHANCEMENT]: Add a callback to the bulk_insert functions

esoha-nvidia commented 1 year ago

Is your feature request related to a problem? Please describe.

I would like to be able to run a function on every key that is inserted by the bulk insert functions.

Describe the solution you'd like

struct MyFunctor {
  void operator()(slot_type *slot, bool is_unique) {
    ...
  }
};

my_set.insert(input.begin(), input.end(), MyFunctor());

Each insert will return the slot and whether this was an insert or the key was already there so this was just a lookup. And then I will do whatever I want with that information. For example, I could use an atomic and fill an array with unique elements, allowing me to perform "retrieve_all" during the insert without reading the table twice.

Describe alternatives you've considered

I can copy and paste all the bulk insert code into my code and then use that. The problem here is that I trust cuco to get the grid shape and everything right and if I do it this way then I might get it wrong. Also, if cuco comes up with improvements to the bulk insert, I could use them.

Additional context

No response

jrhemstad commented 1 year ago

Hm, I'm not against this, but this kind of custom operation seems like what the device-side API is intended for.

esoha-nvidia commented 1 year ago

I'd like to avoid duplicating the work that cuco has already done. So if I can use the bulk operations, all the better!

sleeepyjack commented 1 year ago

I totally get your point about avoiding writing a kernel that is already 90% existent in cuco.

It's really about API flexibility:

Our host bulk APIs aren't very flexible, but solve a simple job fast with just one LOC.
The device APIs give you full control (as it even allows you to do kernel fusing, like we did in our GROUPBY work), however it requires a considerable amount of effort on the developer side to get it right.

The solution you are proposing is somewhere in between: You get the benefit of the easy-to-use host API, plus some additional bells and whistles which otherwise only the device API could give you.

So, when a developer plans to use cuco for a particular complex task, they have the following thought process:

Does the host APi suffice to solve my problem? -> No.
Does a custom functor work? -> Maybe. Requires some investigation.
Does the device API solve my problem? -> Most likely yes, but also takes some time to get it right.

I'm not convinced 2. is helpful when we can just move to 3. directly and be able to optimize the hell out of the custom kernel.

However, we could solve your particular problem with a different approach: Let's say we provide a bulk API for insert_and_find, returning the tuples {slot_reference, is_new}. The output tuples will be written to some output iterator. Instead of naively writing the output to some memory location directly, you could employ one of Thrust's fancy iterators, namely thrust::transform_[input_]output_iterator, which takes your custom functor and applies it to the output pairs.

So you can achieve the same thing that you asked for by using the bulk API with Thrust fancy iterators.

@PointKernel @jrhemstad I have one minor concern about my proposal: We use a shmem bounce buffer to stage the result tuples and then flush them to the output iterator using memcpy_async. Does this work with Thrust fancy iterators?

esoha-nvidia commented 1 year ago

The output iterator is a nice idea. No way it works with the memcpy_async, right? Surely memcpy_async requires that the destination is a real pointer. Worse still, it might compile and work just fine on A100 and before but fail on Hopper.

sleeepyjack commented 1 year ago

Correction: We use the memcpy_async approach only in multimap. For static_map/static_set we store the output directly: https://github.com/NVIDIA/cuCollections/blob/359f5ae67e93b69a8df35ebd1d12f746aac8916e/include/cuco/detail/static_map/kernels.cuh#L121

So using a fancy iterator should work in this scenario.

NVIDIA / cuCollections