Open esoha-nvidia opened 1 year ago
Hm, I'm not against this, but this kind of custom operation seems like what the device-side API is intended for.
I'd like to avoid duplicating the work that cuco has already done. So if I can use the bulk operations, all the better!
I totally get your point about avoiding writing a kernel that is already 90% existent in cuco.
It's really about API flexibility:
The solution you are proposing is somewhere in between: You get the benefit of the easy-to-use host API, plus some additional bells and whistles which otherwise only the device API could give you.
So, when a developer plans to use cuco for a particular complex task, they have the following thought process:
I'm not convinced 2. is helpful when we can just move to 3. directly and be able to optimize the hell out of the custom kernel.
However, we could solve your particular problem with a different approach:
Let's say we provide a bulk API for insert_and_find
, returning the tuples {slot_reference, is_new}
.
The output tuples will be written to some output iterator.
Instead of naively writing the output to some memory location directly, you could employ one of Thrust's fancy iterators, namely thrust::transform_[input_]output_iterator
, which takes your custom functor and applies it to the output pairs.
So you can achieve the same thing that you asked for by using the bulk API with Thrust fancy iterators.
@PointKernel @jrhemstad I have one minor concern about my proposal:
We use a shmem bounce buffer to stage the result tuples and then flush them to the output iterator using memcpy_async
. Does this work with Thrust fancy iterators?
The output iterator is a nice idea. No way it works with the memcpy_async, right? Surely memcpy_async requires that the destination is a real pointer. Worse still, it might compile and work just fine on A100 and before but fail on Hopper.
Correction: We use the memcpy_async
approach only in multimap. For static_map
/static_set
we store the output directly: https://github.com/NVIDIA/cuCollections/blob/359f5ae67e93b69a8df35ebd1d12f746aac8916e/include/cuco/detail/static_map/kernels.cuh#L121
So using a fancy iterator should work in this scenario.
Is your feature request related to a problem? Please describe.
I would like to be able to run a function on every key that is inserted by the bulk insert functions.
Describe the solution you'd like
Each insert will return the slot and whether this was an insert or the key was already there so this was just a lookup. And then I will do whatever I want with that information. For example, I could use an atomic and fill an array with unique elements, allowing me to perform "retrieve_all" during the insert without reading the table twice.
Describe alternatives you've considered
I can copy and paste all the bulk insert code into my code and then use that. The problem here is that I trust cuco to get the grid shape and everything right and if I do it this way then I might get it wrong. Also, if cuco comes up with improvements to the bulk insert, I could use them.
Additional context
No response