Open liuuu6 opened 6 months ago
已收到邮件 阿里阿豆故咋一马斯!如未及时回复,请致电15868848097 QQ:812737452
@liuuu6 Could you tell me where you produced the processed-data-file ? I have download terabyte dateset (day_0 ... dya_23),but it takes too long to preprocesse dataset(I use dlrm_s_criteo_terabyte.sh)
I'm having trouble training a DLRM model with 4 GPUs. When running with full dataset , the model achieves a auc of 0.79 after 12 hours of training with 1 GPU, but the auc only reaches 0.76 when I use 4 GPUs for the same amount of time. The loss function has a large swing. Here are the arguments I used:
torchrun --nproc_per_node=4 dlrm-embbag-sparse.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=2048 --print-time --test-mini-batch-size=16384 --use-gpu --dist-backend=nccl --mlperf-logging --test-freq=409600 --processed-data-file=/criteo/preprocessed/ --nepochs=1 --memory-map --mlperf-bin-shuffle --mlperf-bin-loader --raw-data-file=/criteo/preprocessed/day
(The above parameters refer to dlrm/bench/dlrm_s_criteo_terabyte.sh)Below is the loss chart during my 4-GPU training: