facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

[Question] how are the categorical features in test data converted to the embedding indices? #341

Closed ksjsky9888 closed 11 months ago

ksjsky9888 commented 1 year ago

Hi,

I'm trying to understand the DLRM process with Criteo Kaggle dataset. I've understood the training process and found that each categorical features in training data is converted to an unique index. (For example, "0x68fd1e64" is converted to the index "0x0 (lS_i)")

Here is my question. During the inference process, how are the categorical features converted to the indices which are corresponding to the embedding table? Since the embedding table and the corresponding indices are determined at the training stage, I think that there would be no information to convert feature value to index at the inference stage. However, the index (IS_i) and offset (IS_o) are already determined and input to "apply_emb" function at the inference stage. I don't understand how the indices of categorical features in test data are already determined before embedding look-up at the inference stage.

For example, there's an embedding table for movie list and "spider man" is determined to have an index "3" at the training stage. When a new user's movie list (categorical features) comes and "spider man" is included in the list, how does the inference model know that the index of the "spider man" is "3" before the embedding look-up stage (apply_emb)? Thank you.

Best regards, SJ Kim

mnaumovfb commented 1 year ago

The dataset pre-processing always goes over all the data and splits it into (i) the first days that are used for training and (i) the last day that can be used for testing and validation. This is executed independent of whether the code is run with or without the --inference-only flag.

Therefore the inference uses the testing data (first half of the last day) that has been pre-processed in the same way as it is done for training, so it has consistent assignments of categorical features to indices across them.

Let me know if this helps.