Closed gejinchen closed 11 months ago
The first location allows you to have smaller dictionaries and therefore needs less time and space for pre-processing the dataset. If you are doing a data set pre-processing and run with a given --max-ind-range
that would have been enough.
However, imagine that you would like to decouple of dataset pre-processing and experiments with it. You could pre-process the dataset without --max-in-range
option once and then make multiple runs with --max-ind-range
option set to different settings. That's why you need the ability to trim the index in the second location. The latter requires a bit more advanced setup and familiarity with the code, but it can save a lot of time in some scenarios.
What is the purpose of doing
% max_ind_range
here? HereX_cat
was just converted from raw hexadecimal values to decimal values. It is not embedding table index yet. In fact,X_cat
even still has negative values here. Besides,% max_ind_range
is done later again here anyway.