[Question] Difference between Embedding Training Cache and GPU Embedding Cache

hsezhiyan commented 1 year ago

What is the difference between the Embedding Training Cache (https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/HugeCTR/src/embedding_training_cache) and the GPU Embedding Cache (https://github.com/NVIDIA-Merlin/HugeCTR/tree/main/gpu_cache)?

It appears as if the Embedding Training Cache is used only during training. Does it use the GPU Embedding Cache under the hood?

minseokl commented 1 year ago

Hi @hsezhiyan

Yes the Embedding Training Cache (ETC) is a feature for training, which enables the use of embedding tables beyond the GPU memory capacity. It is not implemented based on the GPU Embedding Cache. Please also note that this feature is under deprecation.
The GPU Embedding Cache is mainly used by our inference use cases, through the Hierarchical Parameter Server (HPS). If you are interested in HPS, please checkout https://nvidia-merlin.github.io/HugeCTR/main/hierarchical_parameter_server/index.html

Thanks, Minseok

hsezhiyan commented 1 year ago

Thank you for the response @minseokl

In that case, will ETC (which is under deprecation) be replaced by GPU Embedding Cache for training cases? Because it looks like GPU Embedding Cache can be used for both inference and training

yingcanw commented 1 year ago

@hsezhiyan The ETC will be be replaced by HierarchicalKV on the training using hierarchical memory. We actually have no plans to integrate the GPU embedding cache into training. In addition, we have completed the implementation of a new generation GPU embedding cache with with higher performance and will release it soon.

sezhiyanhari commented 1 year ago

Thank you for the answer @yingcanw! I'd like to ask a few followup questions:

Are there any instructions on how to use HierarchicalKV during training? I can only find HugeCTR training examples using ETC.
Is there an expected timeframe when the updated GPU embedding cache will be released?
From a design perspective, why are different caching systems (ETC, GPU Embedding Cache) for training and inference? Was there a reason to not include a single caching system for both training and inference?

sezhiyanhari commented 1 year ago

@minseokl if you also have any insights, I would appreciate it!

yingcanw commented 1 year ago

@sezhiyanhari Sorry for the late reply. 1.Here is the relevant API description about HKV. In addition, we have integrated HKV into sok and can conduct seamless training on the tf platform. @kanghui0204 will provide a more detailed introduction, if you have any questions about sok.

It is expected to be soon. If you currently only need the highest performance GPU embedding cache lookup, you can also use this version of the cache.
Because training and inference focus on different indicators in industrial cases. For example, the inference has very strict requirements on prediction latency. At the same time, the model also needs to be updated in real-time with high frequency, which requires the cache to provide high performance of concurrent read and write. However, synchronous training can separate cache R&W, and pipeline can be optimized through operations such as prefetching... Therefore, different cache systems need to be designed to meet the performance requirements of training and inference.

lausannel commented 11 months ago

@sezhiyanhari Sorry for the late reply. 1.Here is the relevant API description about HKV. In addition, we have integrated HKV into sok and can conduct seamless training on the tf platform. @kanghui0204 will provide a more detailed introduction, if you have any questions about sok. 2. It is expected to be soon. If you currently only need the highest performance GPU embedding cache lookup, you can also use this version of the cache. 3. Because training and inference focus on different indicators in industrial cases. For example, the inference has very strict requirements on prediction latency. At the same time, the model also needs to be updated in real-time with high frequency, which requires the cache to provide high performance of concurrent read and write. However, synchronous training can separate cache R&W, and pipeline can be optimized through operations such as prefetching... Therefore, different cache systems need to be designed to meet the performance requirements of training and inference.

Hi, could you provide an example script about training using HKV and SOK?

I am a little confused about how HKV could replace ETC because as far as I know, HKV is a single GPU key-value store. Could it eliminate the Parameter Server in ETC?

Any insights are appreciated.

kanghui0204 commented 11 months ago

Hi @lausannel , here is an example of using SOK+HKV. SOK+HKV example

HKV is a key-value store that uses GPU + CPU memory, where the memory for values can be stored either on the GPU or on the CPU.

HKV repo

lausannel commented 11 months ago

@kanghui0204 Thanks for your explaination!

NVIDIA-Merlin / HugeCTR

[Question] Difference between Embedding Training Cache and GPU Embedding Cache #424