Open ZhanqiuHu opened 10 months ago
It seems like running torchrec.datasets.scripts.npy_preproc_criteo
encounters RuntimeWarning: divide by zero encountered in log
. Is there a workaround for that?
What happens when you run the test and bench script as shown in the documentation?
./test/dlrm_s_test.sh
./bench/dlrm_s_criteo_kaggle.sh --test-freq=1024
Hi, I also get NaN when run it in DLRCs with TorchRec. Did you sovle it? I found that there are some -inf in Kaggle Criteo dataset. I'm not sure if torch team handled it.
I think it is one preprocessing operation in the script that is causing the problem. I ended up using some custom preprocessing steps instead of torchrec.datasets.scripts.npy_preproc_criteo.
I'm also trying to do that. If you still have that script, would you mind sharing it with me? Really thanks for your responding.
Sorry, I'm not working on this now so I didn't keep a copy of the code. I remember I used the some part of the torchrec.datasets.scripts.npy_preproc_criteo code to decode the text to values and got a bunch of numpy files, and then did normalization with the dense values. Hope this helps!
It's ok. Thank you very much.
Hello,
I'm running some training with the Kaggle Criteo dataset, and here is the command I ran:
The model hyperparameters I chose follow this example script. I'm getting Nan results for some iterations. The preprocessed dataset does not contain Nan values, and I have tried using 0.1, 0.01, 0.001 for the start learning rate, but I always get Nan results. Is there something I'm doing wrong here? What might be the cause for this issue?
Thanks!