Multi-GPU training does not converge

liuuu6 commented 6 months ago

I'm having trouble training a DLRM model with 4 GPUs. When running with full dataset , the model achieves a auc of 0.79 after 12 hours of training with 1 GPU, but the auc only reaches 0.76 when I use 4 GPUs for the same amount of time. The loss function has a large swing. Here are the arguments I used:

torchrun --nproc_per_node=4 dlrm-embbag-sparse.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=2048 --print-time --test-mini-batch-size=16384 --use-gpu --dist-backend=nccl --mlperf-logging --test-freq=409600 --processed-data-file=/criteo/preprocessed/ --nepochs=1 --memory-map --mlperf-bin-shuffle --mlperf-bin-loader --raw-data-file=/criteo/preprocessed/day (The above parameters refer to dlrm/bench/dlrm_s_criteo_terabyte.sh)

Below is the loss chart during my 4-GPU training:

202403131644400252180090124170FD

hrwleo commented 6 months ago

已收到邮件阿里阿豆故咋一马斯！如未及时回复，请致电15868848097 QQ：812737452

Qinghe12 commented 5 months ago

@liuuu6 Could you tell me where you produced the processed-data-file ? I have download terabyte dateset (day_0 ... dya_23)，but it takes too long to preprocesse dataset(I use dlrm_s_criteo_terabyte.sh)

facebookresearch / dlrm

Multi-GPU training does not converge #377