facebookresearch / FAMBench

Benchmarks to capture important workloads.
Apache License 2.0
28 stars 23 forks source link

Fix dlrm distributed_forward data corruption check. #81

Closed samiwilf closed 1 year ago

samiwilf commented 2 years ago

Tested fix on cluster of 8 ec2 p4d.24xlarge instances running: python -m torch.distributed.launch --nproc_per_node=8 --nnodes=8 --node_rank=0 --master_addr="172.31.36.159" --master_port=8888 dlrm_s_pytorch.py --data-generation=dataset --data-set=terabyte --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --test-mini-batch-size=16384 --print-freq=1024 --print-time --test-freq=30000 --raw-data-file=/home/ubuntu/mountpoint/criteo_terabyte_subsample0.0_maxind40M/day --processed-data-file=/home/ubuntu/mountpoint/criteo_terabyte_subsample0.0_maxind40M/ --mlperf-logging --memory-map --use-gpu