facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

Size of embedding tables in MLPerf checkpoint #369

Open AlCatt91 opened 9 months ago

AlCatt91 commented 9 months ago

Hello, I am looking at the pre-trained weights for the MLPerf benchmark configuration on Criteo Terabyte that are provided in the README (link). If I understand correctly, this should be the best checkpoint of the configuration that is run with the script ./bench/run_and_time.sh. Based on the code snippet

if args.max_ind_range > 0:
            ln_emb = np.array(
                list(
                    map(
                        lambda x: x if x < args.max_ind_range else args.max_ind_range,
                        ln_emb,
                    )
                )
            )

since that config uses --max-ind-range=40000000, I was expecting the largest embedding tables (namely, tables 0, 9, 19, 20, 21) to be reduced to have exactly 40M rows, however the length of these tensors in the state_dict in the downloaded checkpoint is more variable than that:

**emb_l.0.weight: torch.Size([39884406, 128])**
emb_l.1.weight: torch.Size([39043, 128])
emb_l.2.weight: torch.Size([17289, 128])
emb_l.3.weight: torch.Size([7420, 128])
emb_l.4.weight: torch.Size([20263, 128])
emb_l.5.weight: torch.Size([3, 128])
emb_l.6.weight: torch.Size([7120, 128])
emb_l.7.weight: torch.Size([1543, 128])
emb_l.8.weight: torch.Size([63, 128])
**emb_l.9.weight: torch.Size([38532951, 128])**
emb_l.10.weight: torch.Size([2953546, 128])
emb_l.11.weight: torch.Size([403346, 128])
emb_l.12.weight: torch.Size([10, 128])
emb_l.13.weight: torch.Size([2208, 128])
emb_l.14.weight: torch.Size([11938, 128])
emb_l.15.weight: torch.Size([155, 128])
emb_l.16.weight: torch.Size([4, 128])
emb_l.17.weight: torch.Size([976, 128])
emb_l.18.weight: torch.Size([14, 128])
**emb_l.19.weight: torch.Size([39979771, 128])**
**emb_l.20.weight: torch.Size([25641295, 128])**
**emb_l.21.weight: torch.Size([39664984, 128])**
emb_l.22.weight: torch.Size([585935, 128])
emb_l.23.weight: torch.Size([12972, 128])
emb_l.24.weight: torch.Size([108, 128])
emb_l.25.weight: torch.Size([36, 128])

How does the hashing work for this model? It cannot be just taking the categorical value ID modulo 40M as in the released pytorch code. Moreover, it seems to me that also some of the smaller embedding tables have been reduced in size (suggesting additional custom filtering/merging of the categorical values?)

Also, I am not seeing the test_auc key in the checkpointed dictionary, despite --mlperf-logging being set in ./bench/run_and_time.sh: what's the test AUC of this pre-trained model?