facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

Getting Nan loss when training dlrm with Kaggle Criteo dataset #363

Open ZhanqiuHu opened 10 months ago

ZhanqiuHu commented 10 months ago

Hello,

I'm running some training with the Kaggle Criteo dataset, and here is the command I ran:

torchx run -s local_cwd dist.ddp -j 1x1 --script dlrm_main.py --\
    --in_memory_binary_criteo_path $PREPROCESSED_DATASET \
    --pin_memory \
    --mmap_mode \
    --batch_size 128 \
    --test_batch_size 16384 \
    --learning_rate 0.001 \
    --dataset_name criteo_kaggle \
    --dense_arch_layer_sizes "13,512,256,64,16" \
    --over_arch_layer_sizes "512,256,1" \
    --epochs 10 \
    --embedding_dim 16 \
    --validation_freq_within_epoch 1024 \
    --shuffle_batches

The model hyperparameters I chose follow this example script. I'm getting Nan results for some iterations. The preprocessed dataset does not contain Nan values, and I have tried using 0.1, 0.01, 0.001 for the start learning rate, but I always get Nan results. Is there something I'm doing wrong here? What might be the cause for this issue?

Thanks!

ZhanqiuHu commented 10 months ago

It seems like running torchrec.datasets.scripts.npy_preproc_criteo encounters RuntimeWarning: divide by zero encountered in log. Is there a workaround for that?

mnaumovfb commented 9 months ago

What happens when you run the test and bench script as shown in the documentation? ./test/dlrm_s_test.sh ./bench/dlrm_s_criteo_kaggle.sh --test-freq=1024

TomekWei commented 4 months ago

Hi, I also get NaN when run it in DLRCs with TorchRec. Did you sovle it? I found that there are some -inf in Kaggle Criteo dataset. I'm not sure if torch team handled it.

ZhanqiuHu commented 4 months ago

I think it is one preprocessing operation in the script that is causing the problem. I ended up using some custom preprocessing steps instead of torchrec.datasets.scripts.npy_preproc_criteo.

TomekWei commented 4 months ago

I'm also trying to do that. If you still have that script, would you mind sharing it with me? Really thanks for your responding.

ZhanqiuHu commented 4 months ago

Sorry, I'm not working on this now so I didn't keep a copy of the code. I remember I used the some part of the torchrec.datasets.scripts.npy_preproc_criteo code to decode the text to values and got a bunch of numpy files, and then did normalization with the dense values. Hope this helps!

TomekWei commented 4 months ago

It's ok. Thank you very much.