NVIDIA / cuCollections

Apache License 2.0
485 stars 88 forks source link

[ENHANCEMENT]: Add hasher/comparator adaptor to better support mapping table use case #476

Open PointKernel opened 6 months ago

PointKernel commented 6 months ago

Is your feature request related to a problem? Please describe.

Many hash-based implementations rely on the mapping table method to handle large keys/values that are expensive to copy or move around. Every time users need to write the mapping table hasher/key_equal adaptors on their own, they shouldn't have to.

Describe the solution you'd like

As proposed by @esoha-nvidia, to better serve all users in this scenario, cuco could provide two handy utilities to the user:

  1. mapping_table_hash wrapper constructed from a span (a pointer to the beginning of the original data and a size) and the original hasher
  2. mapping_table_key_eq wrapper constructed from a span and the original key comparator

TODO:

Describe alternatives you've considered

Seeking help for better naming

esoha-nvidia commented 6 months ago

The complication with this adaptor is that the key_equal function depends on the which table is in the input.

For example, say we are doing a bulk_insert followed by a bulk_lookup. And imagine, like in the case of a database, the insert is primary keys from one table and the lookup is foreign keys from another table.

When you do the insert, the key_equal function must perform primary_keys[index_to_insert] == primary_keys[slots_index]. But when we do the lookup, we must perform lookup_keys[index_to_lookup] == primary_keys[slots_index]. Above we have:

mapping_table_key_eq wrapper constructed from a pointer to the beginning of the original data, a size and the original key comparator

So which will be the "pointer to the beginning of the original data"? &primary_keys[0] or &lookup_keys[0]? It's impossible.

One solution would be, for example, to store both inside the equality functor and then make the index a tuple that indicates which table we should use for lookup. This is really wasteful of memory and also quite slow. The problem here is that when we are doing an insert versus a lookup, we know which table should be used for comparison, but that knowledge is thrown away because the equality operator is unable to use it.

I don't have a great solution for this, which is a major reason why I was unable to use cuCo. One possibility is to perhaps make an adapter around an existing hash table that would let you swap the equality operator. For example:

my_hash_table.bulk_insert(keys_to_insert); // Insert using default equality operator
mapping_table_key_eq lookup_table_eq(lookup_table); // Create a new equality operator.
my_hash_table.template with_eq<lookup_table_eq>(lookup_table_eq).bulk_lookup(lookup_table);

The with_eq function is creating a "view" of the original slots but with a new equality operator. The implementation is basically just a reinterpret_cast that modifies a template parameter, so it has no runtime cost. I can't think of a more elegant solution that still adheres to the STL hashing paradigm.

esoha-nvidia commented 5 months ago

Here are some use cases

https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/join/join_common_utils.cuh#L47-L66 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/include/cudf/detail/distinct_hash_join.cuh#L65-L77 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/search/contains_table.cu#L44-L62 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/text/bpe/byte_pair_encoding.cuh#L48-L73

njones93531 commented 5 months ago

As mentioned in slack, I would like to work on this issue