facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.76k stars 837 forks source link

How to access Embedding Tables? #157

Closed Jeongmiu closed 3 years ago

Jeongmiu commented 3 years ago

I study DLRM and embedding table features on pytorch environment and using kaggle dataset. I can't solve some questions by myself. so I ask here.

First, (in _createemb and _applyemb method.) I found EmbeddingBag(n,m,mode="sum") creates [nxm] matrices. and EE, emb_l store EmbeddingBag modules in _createemb. After making tables, apply_emb use them(emb_l ..) by this code E = emb_l[k] , V = E(index, offset,per_sample_weights)

I found offset is a list which is consist of batchsize (0~batch-1). What does index mean? Is this just random value?

and What is the n,m meaning? Pytorch explains n is the number of embedding and m is an embedding dimension. But, I heard that the embedding table is composed of user IDs and Items. and It is factorized separately. n is user ID, m is item. right? I'm confused.

Second, I want to match the concepts between papers and codes. so I just extract the data with print(). I can't find sparse indices on _dlrm_spytorch.py Sparse indices for accessing to embedding tables are made by _dlrm_datapytorch.py?

I'm not good at English. So, If you have the unclear sentences, please ask me. then I'll edit it by using translator. Thanks.

benghaem commented 3 years ago

What does index mean? Is this just random value?

Each index represents a categorical value. Before training, all of the unique categorical values are given an index. You can find this preprocessing step in the code here:https://github.com/facebookresearch/dlrm/blob/1302c71624fa9dbe7f0c75fea719d5e58d33e059/data_utils.py#L876


What is the n,m meaning?

N is the number of rows in the table. Each categorical value is assigned to a row. For example we could have the categorical values:

{Red, Green, Blue}

Which would give us N = 3. (Often in practice we make N < number of unique categorical values. We can use the hashing trick to decrease table size)

M is the dimension of an individual embedding. The embedding table learns M-dimensional vectors for each categorical value. Increasing M can help increase model quality.

In this model user ids and item ids would each get their own table with their own embeddings. To make a prediction for a <user, item> pair we look up the embedding for the user and the embedding for the item.


Sparse indices for accessing to embedding tables are made by dlrm_data_pytorch.py?

Yep. They are read from whatever dataset you're using. For Criteo-Kaggle they are produced here: https://github.com/facebookresearch/dlrm/blob/1302c71624fa9dbe7f0c75fea719d5e58d33e059/dlrm_data_pytorch.py#L284

Jeongmiu commented 3 years ago

Thank you for your quick reply. It's helpful to me. :) More questions.

Please check this flow I've understood is correct

1.Datas are preprocessed by data_utils.py and dlrm_data_pytorch.py
(It makes input data. e.g categorical, continuous features, sparse indices from source-train.txt) 

2. Dense features are processed by bottom MLP (create_mlp, apply_mlp)
Embedding Tables are made by dlrm_s_pytorch.py using sparse indices, categorical features. (create_emb)

3. Access to each embedding table with user_id_index, item_id_index (apply_emb)

4. Concatenate in interaction layer and get CTR using top_mlp

If this is correct, 3 is difficult to me.

Code says

E = emb_l[k] = EmbeddingBag(n,m,...) (k : 0~25) V = E(sparse_index,sparse_offsets)

The number of sparse_index is 26.(they are consisted of various integer numbers),(continuous feature : 13) Half of data is about users and the other half is about items? Does those are preprocessed by 1, used in 3 and need more operations to get complete embedding vector?

If E=emb_l[1] is EmbeddingBag(1460,16), (EmbeddingBag is consist of 1460 data {red,green,purple ...} ) V = E(ind,off) is getting EmbeddingBag(1460,16) data? (getting purple embedding vector) then ind is index about user or item that i want to access?

benghaem commented 3 years ago

Half of data is about users and the other half is about items?

The split does not need to be 50/50. This model supports many categorical features. The Criteo dataset is unlabelled, but if you take a look at the Avazu dataset you can get an idea of what those categorical features may be.

For example we could have an embedding table for each of the following categorical variables: User ID Favorite color, Location Item type ... you can imagine many more categorical features that would describe either the user or context related to the prediction


then ind is index about user or item that i want to access?

The index is for the given categorical feature. In your example this would be "color"

Chapter 8 of this book may be helpful.

Jeongmiu commented 3 years ago

Thank you for your detailed reply!

I think i need to learn more

If I have any other questions, I will open this issue again. :)