NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
905 stars 196 forks source link

[Question] Is there any related architecture design or documentation for embedding collection #444

Closed Jiaao-Bai closed 3 months ago

Jiaao-Bai commented 3 months ago

I would like to learn how embedding collection fuses lookups and poolings from different groups into a single kernel while supporting different vector sizes.

shijieliu commented 3 months ago

hi @Jiaao-Bai thanks for trying out hugectr.

About your question, there is currently no public doc related to our design. The embedding collection is constructed by several components and I would like to provide some guides to help you understand it.

  1. Embedding: Embedding is where we do forward and backward for embedding. Different sharding of embedding has different implementation like data parallel, model parallel. The source code is under embedding.
  2. Embedding Storage: Embedding Storage is how we store the embedding vectors. We provide static, which the number of embedding vectors in the table can not be changed after initialization, and dynamic embedding storage. You can find their implementation under here
  3. Data distributor: This is related to how we convert the data parallel embedding input, keys, into model parallel format that can be consumed by the Embedding. The code is here.
  4. Embedding Operator: Basic operators that we use in the Embedding. We specify the stage of embedding in the forward as model_forward, all2all, network_forward and provide corresponding seperate operators. The same mechinism works for backward as network_backward, all2all, local_reduce. You can find the code here
  5. Generic Lookup: This is a template kernel we used to generate different settings of kernels used in embedding. The code is here

For your question, I think you can refer 1, 2, 4 and 5. I would suggest you start from data parallel embedding since it's easier to understand. Thanks!

Jiaao-Bai commented 3 months ago

thanks