efficient / libcuckoo

A high-performance, concurrent hash table
Other
1.62k stars 275 forks source link

Add cuckoohash_map::find_or_insert_fn #138

Closed milianw closed 2 years ago

milianw commented 3 years ago

This new API makes it trivial to leverage libcuckoo for string interning or similar areas, where constructing a value is potentially expensive or requires further book keeping:

    std::shared_mutex mutex;
    std::vector<std::string> indexToString;
    libcuckoo::cuckoohash_map<std::string, std::size_t> stringToIndex;

    auto stringForIndex = [&](std::size_t index)
    {
        auto lock = std::shared_lock(mutex);
        return indexToString[index];
    };

    auto indexForString = [&](const std::string &string)
    {
        return stringToIndex.find_or_insert_fn(string, [&]() {
            auto lock = std::unique_lock(mutex);
            auto index = indexToString.size();
            indexToString.push_back(string);
            return index;
        });
    };

Note how we only need to acquire the expensive unique_lock when we are inserting a new string key value. In the hopefully common case where the string was already inserted previously, we can rely on the shared libcuckoo locking to reduce lock contention.

I do not see any way to leverage the existing libcuckoo API in a similar manner - but maybe I'm missing something? Note how I cannot use something like the following, as it would contain race conditions for the size call and subsequent insertion into the indexToString vector:

    // problem #1: multiple threads could read the same index value here
    auto index = stringToIndex.size();
    auto wasInserted = stringToIndex.upsert(string, [&index](const auto &existingIndex) {
        index = existingIndex;
    }, index);
    if (wasInserted) {
        // problem #2: order in which indexToString is updated is undefined
        auto lock = std::unique_lock(mutex);
        indexToString.push_back(string);
   }
manugoyal commented 3 years ago

Hi Milian! Yeah I think this is a good point. I had imagined that somebody who wanted to construct values using a custom function could create a custom type, so something like the following:

template <typename T>
struct ObjWrapper {
  template <typename F>
  ObjWrapper(F constructor) : obj(constructor()) {}
  T obj;
};

int main() {
  std::shared_mutex mutex;
  std::vector<std::string> indexToString;
  libcuckoo::cuckoohash_map<std::string, ObjWrapper<size_t>> stringToIndex;

  auto stringForIndex = [&](size_t index) {
    auto lock = std::shared_lock(mutex);
    return indexToString[index];
  };

  auto indexForString = [&](const std::string& string) {
    size_t ret_index;
    stringToIndex.upsert(
        string,
        [&ret_index](const ObjWrapper<size_t>& wrapper) {
          ret_index = wrapper.obj;
        },
        [&] {
          auto lock = std::unique_lock(mutex);
          ret_index = indexToString.size();
          indexToString.push_back(string);
          return ret_index;
        });
    return ret_index;
  };

  ...
}

The indexForString function is a bit more cumbersome than what you have written, but I think it should be the same efficiency?

milianw commented 3 years ago

Yes, you are right - using a different type would work too - but I have to ask: Why not simply adapt the API of cuckoohash_map to make this simple and explicit? I mean there are already some _fn types, this one is just missing?

milianw commented 3 years ago

Ping? Anything i can do to move this forward? Or are you adamant about moving this to the type side instead?

manugoyal commented 2 years ago

Sorry for the late reply @milianw. I am not opposed to modifying the API, but I do feel that adding an entirely new function like this may become confusing. Note that the functions passed to the _fn methods are for mutating the value in place, once it is already in the table (essentially letting the user do arbitrary operations on the mapped_type& under the lock), so repurposing that naming convention to mean "constructor" function may be confusing.

Furthermore, it would be nice to stick to an API that is analogous to the STL, which doesn't implement any special overrides for constructing-from-function. However, it is definitely easier to mutate a value after default construction in STL, with something like this:

auto [it, is_new] = map.try_emplace(key /* value will be default-constructed */);
if (is_new) {
  // Set it->second to whatever is desired.
}

So maybe we can emulate this functionality by augmenting the _fn in uprase_fn, so that you can pass in a functor which implements operator()(mapped_type&, bool /* is_new */). If you pass in such a functor, uprase_fn will invoke the functor both when the key was already in the map and if it was newly-inserted, and the user can use the is_new boolean to distinguish between cases. This differs from the current behavior, where the functor is only invoked if the key is already in the map.

Does this sound like a reasonable API?

milianw commented 2 years ago

Hey there.

Yes, your proposal could work, I'm not overly attached to my proposal. Just as a suggestion though, I would not use a plain bool argument but rather some kind of enum class to make this API more explicit and self explanatory what the bool state means.

Cheers

manugoyal commented 2 years ago

Hi @milianw, just created https://github.com/efficient/libcuckoo/pull/141, which adds the functionality we discussed. Let me know if this looks alright, and I'll merge it in!

milianw commented 2 years ago

superseded by https://github.com/efficient/libcuckoo/pull/141