Open tuanavu opened 11 months ago
Regarding 2:
Using the parallel_hash_map
as your volatile_db
is the suggested approach, if you cannot put the entire embedding table directly into the GPU.
Regarding 3:
For performance reasons (avoid frequent small allocations) and long term memory fragmentation the hash_map
backends allocate memory in chunks. The size of these chunks is 256 MiB. Since you have 42 tables, that means at least 42 x 256 MiB = 10752 MiB will be allocated. Given that your EC2 instance only has 16 GiB memory, you seeing that OOM (Out-Of-Memory) error is not too surprising. However, I noticed your tables are rather small. I think, without loss of performance, it should be fine to decrease the allocation rate to 128 MiB, 100 MiB or even lower like 64 MiB.
@tuanavu Regarding the 2rd question, I have some comments here. We already support quantization for fp8 in the static embedding cache from v23.08. HPS will perform fp8 quantization on the embedding vector when reading the embedding table by enable "fp8_quant": true
and embedding_cache_type":"static"
item in HPS json configuration file, and perform fp32 dequantization on the embedding vector corresponding to the queried embedding key in the static embedding cache, so as to ensure the accuracy of dense part prediction.
Since the embedding is stored with fp8 type and the GPU memory size will be greatly reduced. However, due to different business use cases, the precision loss caused by quantization/dequantization still needs to be evaluated in the real production. So currently we only have experimental support for static embedding caching for POC verification. If quantization can bring greater benefits to your case, we will add quantization features to dynamics and upcoming lock-free optimized gpu cache.
Details
My company currently operates a Recommender model trained with TensorFlow 2 (TF2) and served on CPU pods. We are exploring the potential of HugeCTR due to its promising GPU embedding cache capabilities and are considering switching our model to it. We have successfully retrained our existing TF2 model with the SparseOperationsKit (more info) and created the inference graph with HPS, as demonstrated in these notebooks: sok_to_hps_dlrm_demo.ipynb and demo_for_tf_trained_model.ipynb
Result: We deployed the model and used Triton's perf_analyzer to test its performance with varying batch sizes. The results were as follows:
Testing Environment:
To maximize throughput, we plan to test the model across different instance types with varying GPU memory sizes. However, optimizing different parameters in config and selecting the best instance type for inference requires a clear understanding of how embedding cache size is calculated.
Details about the current model and embedding tables:
Our current model has various dense, sparse and pre-trained sparse features. After exporting the TF+SOK model to HPS, we have total 42 embedding tables, i.e.:
sparse_files
in hps_config.json. Here’s the stats:hps.Init
outputhps_config.json
used for inferenceQuestions
allocation_rate
configuration in the abovevolatile_db
. I observed that I must reduceallocation_rate = 1e6
, or else the default allocation (256 MiB) leads to out-of-memory issue duringhps.init
. Could you explain why this happens and provide some insights into this matter?