NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
905 stars 196 forks source link

[Question] Difference between Embedding Training Cache and GPU Embedding Cache #424

Open hsezhiyan opened 8 months ago

hsezhiyan commented 8 months ago

What is the difference between the Embedding Training Cache (https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/HugeCTR/src/embedding_training_cache) and the GPU Embedding Cache (https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/gpu_cache)?

It appears as if the Embedding Training Cache is used only during training. Does it use the GPU Embedding Cache under the hood?

minseokl commented 8 months ago

Hi @hsezhiyan

Thanks, Minseok

hsezhiyan commented 8 months ago

Thank you for the response @minseokl

In that case, will ETC (which is under deprecation) be replaced by GPU Embedding Cache for training cases? Because it looks like GPU Embedding Cache can be used for both inference and training

yingcanw commented 8 months ago

@hsezhiyan The ETC will be be replaced by HierarchicalKV on the training using hierarchical memory. We actually have no plans to integrate the GPU embedding cache into training. In addition, we have completed the implementation of a new generation GPU embedding cache with with higher performance and will release it soon.

sezhiyanhari commented 8 months ago

Thank you for the answer @yingcanw! I'd like to ask a few followup questions:

  1. Are there any instructions on how to use HierarchicalKV during training? I can only find HugeCTR training examples using ETC.
  2. Is there an expected timeframe when the updated GPU embedding cache will be released?
  3. From a design perspective, why are different caching systems (ETC, GPU Embedding Cache) for training and inference? Was there a reason to not include a single caching system for both training and inference?
sezhiyanhari commented 7 months ago

@minseokl if you also have any insights, I would appreciate it!

yingcanw commented 7 months ago

@sezhiyanhari Sorry for the late reply. 1.Here is the relevant API description about HKV. In addition, we have integrated HKV into sok and can conduct seamless training on the tf platform. @kanghui0204 will provide a more detailed introduction, if you have any questions about sok.

  1. It is expected to be soon. If you currently only need the highest performance GPU embedding cache lookup, you can also use this version of the cache.
  2. Because training and inference focus on different indicators in industrial cases. For example, the inference has very strict requirements on prediction latency. At the same time, the model also needs to be updated in real-time with high frequency, which requires the cache to provide high performance of concurrent read and write. However, synchronous training can separate cache R&W, and pipeline can be optimized through operations such as prefetching... Therefore, different cache systems need to be designed to meet the performance requirements of training and inference.
lausannel commented 6 months ago

@sezhiyanhari Sorry for the late reply. 1.Here is the relevant API description about HKV. In addition, we have integrated HKV into sok and can conduct seamless training on the tf platform. @kanghui0204 will provide a more detailed introduction, if you have any questions about sok. 2. It is expected to be soon. If you currently only need the highest performance GPU embedding cache lookup, you can also use this version of the cache. 3. Because training and inference focus on different indicators in industrial cases. For example, the inference has very strict requirements on prediction latency. At the same time, the model also needs to be updated in real-time with high frequency, which requires the cache to provide high performance of concurrent read and write. However, synchronous training can separate cache R&W, and pipeline can be optimized through operations such as prefetching... Therefore, different cache systems need to be designed to meet the performance requirements of training and inference.

Hi, could you provide an example script about training using HKV and SOK?

I am a little confused about how HKV could replace ETC because as far as I know, HKV is a single GPU key-value store. Could it eliminate the Parameter Server in ETC?

Any insights are appreciated.

kanghui0204 commented 6 months ago

Hi @lausannel , here is an example of using SOK+HKV. SOK+HKV example

HKV is a key-value store that uses GPU + CPU memory, where the memory for values can be stored either on the GPU or on the CPU.

HKV repo

lausannel commented 6 months ago

@kanghui0204 Thanks for your explaination!