fail to run dlrm_s_pytorch.py on single node multiple GPUs with nccl

Hi Team,

I am able to run python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6 --use-gpu. But when I am trying to run dlrm_s_pytorch.py on single node multiple GPUs with nccl. Here is the command I used:

python -m torch.distributed.launch --nproc_per_node=2 dlrm_s_pytorch.py --arch-embedding-size="80000-80000-80000-80000-80000-80000-80000-80000" --arch-sparse-feature-size=64 --arch-mlp-bot="128-128-128-128" --arch-mlp-top="512-512-512-256-1" --max-ind-range=40000000 --data-generation=random --loss-function=bce --round-targets=True --learning-rate=1.0 --mini-batch-size=2048 --print-freq=2 --print-time --test-freq=2 --test-mini-batch-size=2048 --memory-map --use-gpu --num-batches=100 --dist-backend=nccl

I got tons of errors:

pytorch2.0.0/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Unable to import onnx.  No module named 'onnx'
usage: dlrm_s_pytorch.py [-h] [--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE] [--arch-embedding-size ARCH_EMBEDDING_SIZE] [--arch-mlp-bot ARCH_MLP_BOT] [--arch-mlp-top ARCH_MLP_TOP] [--arch-interaction-op {dot,cat}]
                         [--arch-interaction-itself] [--weighted-pooling WEIGHTED_POOLING] [--md-flag] [--md-threshold MD_THRESHOLD] [--md-temperature MD_TEMPERATURE] [--md-round-dims] [--qr-flag] [--qr-threshold QR_THRESHOLD]
                         [--qr-operation QR_OPERATION] [--qr-collisions QR_COLLISIONS] [--activation-function ACTIVATION_FUNCTION] [--loss-function LOSS_FUNCTION] [--loss-weights LOSS_WEIGHTS] [--loss-threshold LOSS_THRESHOLD]
                         [--round-targets ROUND_TARGETS] [--data-size DATA_SIZE] [--num-batches NUM_BATCHES] [--data-generation DATA_GENERATION] [--rand-data-dist RAND_DATA_DIST] [--rand-data-min RAND_DATA_MIN]
                         [--rand-data-max RAND_DATA_MAX] [--rand-data-mu RAND_DATA_MU] [--rand-data-sigma RAND_DATA_SIGMA] [--data-trace-file DATA_TRACE_FILE] [--data-set DATA_SET] [--raw-data-file RAW_DATA_FILE]
                         [--processed-data-file PROCESSED_DATA_FILE] [--data-randomize DATA_RANDOMIZE] [--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING] [--max-ind-range MAX_IND_RANGE]
                         [--data-sub-sample-rate DATA_SUB_SAMPLE_RATE] [--num-indices-per-lookup NUM_INDICES_PER_LOOKUP] [--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED] [--num-workers NUM_WORKERS] [--memory-map]
                         [--mini-batch-size MINI_BATCH_SIZE] [--nepochs NEPOCHS] [--learning-rate LEARNING_RATE] [--print-precision PRINT_PRECISION] [--numpy-rand-seed NUMPY_RAND_SEED] [--sync-dense-params SYNC_DENSE_PARAMS]
                         [--optimizer OPTIMIZER] [--dataset-multiprocessing] [--inference-only] [--quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT] [--quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT] [--save-onnx] [--use-gpu]
                         [--local_rank LOCAL_RANK] [--dist-backend DIST_BACKEND] [--print-freq PRINT_FREQ] [--test-freq TEST_FREQ] [--test-mini-batch-size TEST_MINI_BATCH_SIZE] [--test-num-workers TEST_NUM_WORKERS] [--print-time]
                         [--print-wall-time] [--debug-mode] [--enable-profiling] [--plot-compute-graph] [--tensor-board-filename TENSOR_BOARD_FILENAME] [--save-model SAVE_MODEL] [--load-model LOAD_MODEL] [--mlperf-logging]
                         [--mlperf-acc-threshold MLPERF_ACC_THRESHOLD] [--mlperf-auc-threshold MLPERF_AUC_THRESHOLD] [--mlperf-bin-loader] [--mlperf-bin-shuffle] [--mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER]
                         [--lr-num-warmup-steps LR_NUM_WARMUP_STEPS] [--lr-decay-start-step LR_DECAY_START_STEP] [--lr-num-decay-steps LR_NUM_DECAY_STEPS]
dlrm_s_pytorch.py: error: unrecognized arguments: --local-rank=1
Unable to import onnx.  No module named 'onnx'
usage: dlrm_s_pytorch.py [-h] [--arch-sparse-feature-size ARCH_SPARSE_FEATURE_SIZE] [--arch-embedding-size ARCH_EMBEDDING_SIZE] [--arch-mlp-bot ARCH_MLP_BOT] [--arch-mlp-top ARCH_MLP_TOP] [--arch-interaction-op {dot,cat}]
                         [--arch-interaction-itself] [--weighted-pooling WEIGHTED_POOLING] [--md-flag] [--md-threshold MD_THRESHOLD] [--md-temperature MD_TEMPERATURE] [--md-round-dims] [--qr-flag] [--qr-threshold QR_THRESHOLD]
                         [--qr-operation QR_OPERATION] [--qr-collisions QR_COLLISIONS] [--activation-function ACTIVATION_FUNCTION] [--loss-function LOSS_FUNCTION] [--loss-weights LOSS_WEIGHTS] [--loss-threshold LOSS_THRESHOLD]
                         [--round-targets ROUND_TARGETS] [--data-size DATA_SIZE] [--num-batches NUM_BATCHES] [--data-generation DATA_GENERATION] [--rand-data-dist RAND_DATA_DIST] [--rand-data-min RAND_DATA_MIN]
                         [--rand-data-max RAND_DATA_MAX] [--rand-data-mu RAND_DATA_MU] [--rand-data-sigma RAND_DATA_SIGMA] [--data-trace-file DATA_TRACE_FILE] [--data-set DATA_SET] [--raw-data-file RAW_DATA_FILE]
                         [--processed-data-file PROCESSED_DATA_FILE] [--data-randomize DATA_RANDOMIZE] [--data-trace-enable-padding DATA_TRACE_ENABLE_PADDING] [--max-ind-range MAX_IND_RANGE]
                         [--data-sub-sample-rate DATA_SUB_SAMPLE_RATE] [--num-indices-per-lookup NUM_INDICES_PER_LOOKUP] [--num-indices-per-lookup-fixed NUM_INDICES_PER_LOOKUP_FIXED] [--num-workers NUM_WORKERS] [--memory-map]
                         [--mini-batch-size MINI_BATCH_SIZE] [--nepochs NEPOCHS] [--learning-rate LEARNING_RATE] [--print-precision PRINT_PRECISION] [--numpy-rand-seed NUMPY_RAND_SEED] [--sync-dense-params SYNC_DENSE_PARAMS]
                         [--optimizer OPTIMIZER] [--dataset-multiprocessing] [--inference-only] [--quantize-mlp-with-bit QUANTIZE_MLP_WITH_BIT] [--quantize-emb-with-bit QUANTIZE_EMB_WITH_BIT] [--save-onnx] [--use-gpu]
                         [--local_rank LOCAL_RANK] [--dist-backend DIST_BACKEND] [--print-freq PRINT_FREQ] [--test-freq TEST_FREQ] [--test-mini-batch-size TEST_MINI_BATCH_SIZE] [--test-num-workers TEST_NUM_WORKERS] [--print-time]
                         [--print-wall-time] [--debug-mode] [--enable-profiling] [--plot-compute-graph] [--tensor-board-filename TENSOR_BOARD_FILENAME] [--save-model SAVE_MODEL] [--load-model LOAD_MODEL] [--mlperf-logging]
                         [--mlperf-acc-threshold MLPERF_ACC_THRESHOLD] [--mlperf-auc-threshold MLPERF_AUC_THRESHOLD] [--mlperf-bin-loader] [--mlperf-bin-shuffle] [--mlperf-grad-accum-iter MLPERF_GRAD_ACCUM_ITER]
                         [--lr-num-warmup-steps LR_NUM_WARMUP_STEPS] [--lr-decay-start-step LR_DECAY_START_STEP] [--lr-num-decay-steps LR_NUM_DECAY_STEPS]
dlrm_s_pytorch.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 375622) of binary: /home/xxx/.conda/envs/torch2.0/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/xxxpkg/pytorch2.0.0/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/pkg/pytorch2.0.0/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
dlrm_s_pytorch.py FAILED

I used the command listed at README.md. I am wondering if that is no longer the correct command to run with (if so, what is the right command to run), or could you tell me more about what I did wrong?

Thanks in advance! Best, Yuxin

facebookresearch / dlrm

fail to run dlrm_s_pytorch.py on single node multiple GPUs with nccl #359