NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
905 stars 196 forks source link

sok-experiment static_map empty_key_sentinel and reclaimed_key_sentinel is not right for int64 [BUG] #432

Closed amazingyyc closed 7 months ago

amazingyyc commented 7 months ago

Describe the bug sok-experiment static_map create 2 special key for HashTable:empty_key_sentinel/empty_key_sentinel It's -1 and -2 when key type is int64. In some case the real training data's value maybe -1 or -2. it will make core dump.

code:https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/sparse_operation_kit/experiment/variable/impl/dynamic_embedding_table/cuCollections/include/cuco/static_map.cuh#L96

static constexpr key_type empty_key_sentinel = ~static_cast<key_type>(0);
static constexpr key_type reclaimed_key_sentinel = (~static_cast<key_type>(0)) ^ 1;
kanghui0204 commented 7 months ago

Hi @amazingyyc , Thank you for finding a bug in DET. Actually, we haven't tested the scenario before where the key has a negative value. empty_key_sentinel and empty_key_sentinelare two reserved values in the hash table of DET, representing whether the slot is empty or the slot's key has been deleted. In DET, we must reserve two values for empty_key_sentineland empty_key_sentinel. The method I can think of is to expose an interface allowing users to set these two values.

However, the timing of the setting is a question (1. Determine during the compilation period, controlled by macro definitions; 2. Determine during the runtime, with significant changes). I need to discuss with my colleagues about how to modify this. If we come to any conclusions and changes, I will notify you.

amazingyyc commented 7 months ago

Hi @amazingyyc , Thank you for finding a bug in DET. Actually, we haven't tested the scenario before where the key has a negative value. empty_key_sentinel and empty_key_sentinelare two reserved values in the hash table of DET, representing whether the slot is empty or the slot's key has been deleted. In DET, we must reserve two values for empty_key_sentineland empty_key_sentinel. The method I can think of is to expose an interface allowing users to set these two values.

However, the timing of the setting is a question (1. Determine during the compilation period, controlled by macro definitions; 2. Determine during the runtime, with significant changes). I need to discuss with my colleagues about how to modify this. If we come to any conclusions and changes, I will notify you.

A choice maybe help:

template <typename KeyT>
struct SentinelKey {
  static KeyT EmptyKey() {
    if (std::numeric_limits<KeyT>::is_signed) {
      return std::numeric_limits<KeyT>::max();
    } else {
      return std::numeric_limits<KeyT>::max();
    }
  }

  static KeyT ReclaimedKey() {
    if (std::numeric_limits<KeyT>::is_signed) {
      return std::numeric_limits<KeyT>::min();
    } else {
      return std::numeric_limits<KeyT>::max() - 1;
    }
  }
};

template <.... typename ReservedKeyT=SentinelKey<KeyT>>
class static_map {
  static_map(): empty_key_(ReservedKeyT::EmptyKey()), reclaimed_key_(ReservedKeyT::ReclaimedKey())
}
kanghui0204 commented 7 months ago

Thank you very much for your code. I had considered using this method, which is easy to modify and elegant, and it's much safer than -1/-2. However, I haven't chosen this method temporarily because user’s key are still occupied by two values without their awareness and cannot change these values. So, I will discuss with my colleagues on Friday about how to solve this issue.

kanghui0204 commented 7 months ago

Hi @amazingyyc , I already fix this bug, and thank you very much to found the bugs , the fix update in next release!