AssertionError: This version of c10d does not support no_copy option

Hi APEX, Can you please suggest how to work around the failed "c10d no_copy" assertion in https://github.com/NVIDIA/apex/blob/master/apex/contrib/optimizers/distributed_fused_lamb.py#L140?

    assert ('no_copy' in inspect.getfullargspec(torch.distributed.reduce_scatter).args), "This version of c10d does not support no_copy option"
AssertionError: This version of c10d does not support no_copy option

Perhaps it deprecated after NVIDIA's run in this June. Which pip wheel in https://pytorch.org/ shall I use? Or how can I get a PyTorch with c10d no_copy support?

Here's how to reproduce it: Run NVIDIA MLPerf PyTorch BERT code with python39+cuda11.1+pytorch1.9/1.8/1.10nightly+apex master All failed.

Command:

python \
/home/mlperfbert/run_pretraining.py \
--distributed_lamb \
--train_batch_size=48 \
--learning_rate=1.5e-3 \
--opt_lamb_beta_1=0.83 \
--opt_lamb_beta_2=0.925 \
--warmup_proportion=0.0 \
--warmup_steps=100 \
--start_warmup_step=-25 \
--max_steps=1271 \
--phase2 \
--max_seq_length=512 \
--max_predictions_per_seq=76 \
--do_train \
--skip_checkpoint \
--train_mlm_accuracy_window_size=0 \
--target_mlm_accuracy=0.720 \
--weight_decay_rate=0.01 \
--max_samples_termination=4500000 \
--eval_iter_start_samples=175000 \
--eval_iter_samples=175000 \
--eval_batch_size=16 \
--cache_eval_data \
--fp16 \
--fused_gelu_bias \
--fused_mha \
--dense_seq_output \
--unpad \
--unpad_fmha \
--exchange_padding \
--dwu-num-rs-pg=1 \
--dwu-num-ar-pg=1 \
--dwu-num-blocks=1 \
--gradient_accumulation_steps=1 \
--log_freq=0 \
--local_rank=0 \
--allreduce_post_accumulation \
--allreduce_post_accumulation_fp16 \
--seed=3627

Appreciate any suggestions.

NVIDIA / apex

AssertionError: This version of c10d does not support no_copy option #1134