facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

`% max_ind_range` in a strange place #347

Closed gejinchen closed 11 months ago

gejinchen commented 1 year ago

What is the purpose of doing % max_ind_range here? Here X_cat was just converted from raw hexadecimal values to decimal values. It is not embedding table index yet. In fact, X_cat even still has negative values here. Besides, % max_ind_range is done later again here anyway.

mnaumovfb commented 1 year ago

The first location allows you to have smaller dictionaries and therefore needs less time and space for pre-processing the dataset. If you are doing a data set pre-processing and run with a given --max-ind-range that would have been enough.

However, imagine that you would like to decouple of dataset pre-processing and experiments with it. You could pre-process the dataset without --max-in-range option once and then make multiple runs with --max-ind-range option set to different settings. That's why you need the ability to trim the index in the second location. The latter requires a bit more advanced setup and familiarity with the code, but it can save a lot of time in some scenarios.