facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

how to train dlrm with multi-gpu #354

Open DONGDILLON opened 1 year ago

DONGDILLON commented 1 year ago

Hi, I have used 8 gpus to train dlrm recently. The command I use is python3 -m torch.distributed.launch --nproc_per_node 4 python3 dlrm_s_pytorch.py --arch-sparse-feature-size=64 --arch-mlp-bot="13-512-256-64" --arch-mlp-top="512-512-256-1" --max-ind-range=10000000 --data-generation=dataset --data-set=terabyte --raw-data-file=./input/day --processed-data-file=./input/terabyte_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=2048 --print-freq=1024 --print-time --test-mini-batch-size=16384 --test-num-workers=16 --use-gpu--dist-backend='nccl However it cannot build connection within multi-gpu. Please help

mnaumovfb commented 9 months ago

How many GPUs on the machine do you have? Can you try the command from the readme (Benchmarking, Section 5 "The code now supports synchronous distributed training ..." and share the error message? # for single node 8 gpus and nccl as backend on randomly generated dataset: python -m torch.distributed.launch --nproc_per_node=8 dlrm_s_pytorch.py --arch-embedding-size="80000-80000-80000-80000-80000-80000-80000-80000" --arch-sparse-feature-size=64 --arch-mlp-bot="128-128-128-128" --arch-mlp-top="512-512-512-256-1" --max-ind-range=40000000 --data-generation=random --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2 --print-time --test-freq=2 --test-mini-batch-size=2048 --memory-map --use-gpu --num-batches=100 --dist-backend=nccl