Closed sparkling9809 closed 10 months ago
Thanks for your question. HPS initializes the embedding cache on each device through the deployed_devices
configured by the user (each device has an independent embedding cache and is not shared between devices). That is to say, the embedding vector in the embedding cache of each device is complete, so there is no need to query an embedding vector from multiple caches.
Ok, thanks for your reply. If the storage of GPU is not enough for hold the whole embedding table, I should set the gpu cache percentage under 100%。 Is it right?
Currently HPS supports three embedding cache types. If your embedding table can be loaded into the GPU, it is recommended to choose static to get the best performance. If you need to update the embedding cache dynamically online inference and the embedding table cannot be fully loaded into the GPU, you can choose dynamic. For details, please refer to the following link: HPS configuration book and HPS Arch.
Currently HPS supports three embedding cache types. If your embedding table can be loaded into the GPU, it is recommended to choose static to get the best performance. If you need to update the embedding cache dynamically online inference and the embedding table cannot be fully loaded into the GPU, you can choose dynamic. For details, please refer to the following link: HPS configuration book and HPS Arch.
OK, Thanks very much for your reply!
I found the lookup implementation for HPS just support to specify a GPU deviceid. So I confused is there any way for hps to load an embedding table into multiple GPUs?If there is , how can I lookup an embeding vector on multiple GPUs?
the HPS code location : https://github.com/NVIDIA-Merlin/HugeCTR/blob/91c5c9f16060ffd7ac99867e283f157e85e8a05d/HugeCTR/include/pybind/hps_wrapper.hpp#L41