facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

Multi-GPU training does not converge #377

Open liuuu6 opened 6 months ago

liuuu6 commented 6 months ago

I'm having trouble training a DLRM model with 4 GPUs. When running with full dataset , the model achieves a auc of 0.79 after 12 hours of training with 1 GPU, but the auc only reaches 0.76 when I use 4 GPUs for the same amount of time. The loss function has a large swing. Here are the arguments I used:

torchrun --nproc_per_node=4 dlrm-embbag-sparse.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=2048 --print-time --test-mini-batch-size=16384 --use-gpu --dist-backend=nccl --mlperf-logging --test-freq=409600 --processed-data-file=/criteo/preprocessed/ --nepochs=1 --memory-map --mlperf-bin-shuffle --mlperf-bin-loader --raw-data-file=/criteo/preprocessed/day (The above parameters refer to dlrm/bench/dlrm_s_criteo_terabyte.sh)

Below is the loss chart during my 4-GPU training:

202403131644400252180090124170FD

hrwleo commented 6 months ago

已收到邮件  阿里阿豆故咋一马斯!如未及时回复,请致电15868848097  QQ:812737452  

Qinghe12 commented 5 months ago

@liuuu6 Could you tell me where you produced the processed-data-file ? I have download terabyte dateset (day_0 ... dya_23),but it takes too long to preprocesse dataset(I use dlrm_s_criteo_terabyte.sh)