Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.54k stars 3.39k forks source link

NCCL when trying to train on 2 nodes #19544

Open waynemystir opened 9 months ago

waynemystir commented 9 months ago

Bug description

I am trying to run a very simple training script for 2 nodes and I always get this error:

Output:

(ve) root@442a8ba5c0c6:~/ptl# . wr.sh 
Start fitting...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function getCvarBool)
442a8ba5c0c6:126284:126284 [0] NCCL INFO cudaDriverVersion 12030
442a8ba5c0c6:126284:126284 [0] NCCL INFO Bootstrap : Using eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126284 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
442a8ba5c0c6:126284:126284 [0] NCCL INFO init.cc:1627 Cuda Host Alloc Size 4 pointer 0x7f1756e00000
442a8ba5c0c6:126284:126391 [0] NCCL INFO Failed to open libibverbs.so[.1]
442a8ba5c0c6:126284:126391 [0] NCCL INFO NET/Socket : Using [0]eth0:172.18.0.2<0>
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using non-device net plugin version 0
442a8ba5c0c6:126284:126391 [0] NCCL INFO Using network Socket

442a8ba5c0c6:126284:126391 [0] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:565 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO misc/socket.cc:619 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO bootstrap.cc:274 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO init.cc:1388 -> 2
442a8ba5c0c6:126284:126391 [0] NCCL INFO group.cc:64 -> 2 [Async thread]
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:418 -> 2
442a8ba5c0c6:126284:126284 [0] NCCL INFO group.cc:95 -> 2
[rank1]:[W Module.cpp:156] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

Traceback (most recent call last):
  File "/root/ptl/wrain.py", line 68, in <module>
    main()
  File "/root/ptl/wrain.py", line 62, in main
    trainer.fit(model, train_dataloaders=train_data)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 943, in _run
    self.__setup_profiler()
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1078, in __setup_profiler
    self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1248, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/root/ptl/ve/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 307, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/ptl/ve/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast
    work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to 192.168.240.2<39817> failed : Software caused connection abort
Exception raised from getNCCLComm at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691 (most recent call first):
C++ CapturedTraceback:
#4 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
#5 c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) [clone .cold] from ProcessGroupNCCL.cpp:0
#6 c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from ??:0
#7 c10d::ops::(anonymous namespace)::broadcast_CUDA(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from Ops.cpp:0
#8 std::decay<c10::guts::infer_function_traits<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> > >::type::return_type>::type c10::impl::call_functor_with_args_from_stack_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>(c10::OperatorKernel*, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*, std::integer_sequence<unsigned long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul>, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long>*) from :0
#9 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (*)(c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long> >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from :0
#10 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from autograd_not_implemented_fallback.cpp:0
#11 void c10::BoxedKernel::make_boxed_function<&(anonymous namespace)::autograd_fallback>(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) from VariableFallbackKernel.cpp:0
#12 c10::impl::BoxedKernelWrapper<std::tuple<std::vector<at::Tensor, std::allocator<at::Tensor> >, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > (c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long), void>::call(c10::BoxedKernel const&, c10::OperatorHandle const&, c10::DispatchKeySet, c10::ArrayRef<at::Tensor>, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, long, long, bool, long) from :0
#13 c10d::ProcessGroup::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) from :0
#14 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::*)(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (*)(c10d::ProcessGroup*, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0
#15 pybind11::cpp_function::dispatcher(_object*, _object*, _object*) from :0
#16 PyObject_CallFunctionObjArgs from ??:0
#17 _PyObject_MakeTpCall from ??:0
#18 PyMethod_New from ??:0
#19 _PyEval_EvalFrameDefault from ??:0
#20 _PyFunction_Vectorcall from ??:0
#21 PyObject_Call from ??:0
#22 _PyEval_EvalFrameDefault from ??:0
#23 _PyFunction_Vectorcall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyFunction_Vectorcall from ??:0
#26 PyObject_Call from ??:0
#27 _PyEval_EvalFrameDefault from ??:0
#28 _PyFunction_Vectorcall from ??:0
#29 _PyEval_EvalFrameDefault from ??:0
#30 _PyFunction_Vectorcall from ??:0
#31 _PyEval_EvalFrameDefault from ??:0
#32 PyObject_IsTrue from ??:0
#33 _PyObject_GenericGetAttrWithDict from ??:0
#34 PyObject_GetAttr from ??:0
#35 _PyEval_EvalFrameDefault from ??:0
#36 _PyFunction_Vectorcall from ??:0
#37 _PyEval_EvalFrameDefault from ??:0
#38 PyMethod_New from ??:0
#39 _PyEval_EvalFrameDefault from ??:0
#40 PyMethod_New from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyMethod_New from ??:0
#43 PyObject_Call from ??:0
#44 _PyEval_EvalFrameDefault from ??:0
#45 _PyFunction_Vectorcall from ??:0
#46 _PyEval_EvalFrameDefault from ??:0
#47 PyMethod_New from ??:0
#48 _PyEval_EvalFrameDefault from ??:0
#49 _PyFunction_Vectorcall from ??:0
#50 _PyEval_EvalFrameDefault from ??:0
#51 PyEval_EvalCode from ??:0
#52 PyEval_EvalCode from ??:0
#53 PyUnicode_Tailmatch from ??:0
#54 PyInit__collections from ??:0
#55 PyUnicode_Tailmatch from ??:0
#56 _PyRun_SimpleFileObject from ??:0
#57 _PyRun_AnyFileObject from ??:0
#58 Py_RunMain from ??:0
#59 Py_BytesMain from ??:0
#60 __libc_init_first from ??:0
#61 __libc_start_main from ??:0
#62 _start from ??:0`

What version are you seeing the problem on?

v2.2

How to reproduce the bug

I'm trying to run a very simple script for 2 nodes:

`export MASTER_ADDR=194.26.196.214
export MASTER_PORT=31421
export WORLD_SIZE=2
export NODE_RANK=1
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_SHOW_CPP_STACKTRACES=1
export TORCH_NCCL_BLOCKING_WAIT=1

export NCCL_P2P_DISABLE=1
export NCCL_P2P_LEVEL=NVL
export NCCL_IB_GID_INDEX=3
python train.py`

Where train.py is this:

`import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer

class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x): 
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        print("batch_idx=", batch_idx)
        loss = self(batch).sum()
        self.log(
            "train_loss",
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)

def main():
    train_data = DataLoader(RandomDataset(32, 64000), batch_size=8000)
    model = BoringModel()
    trainer = Trainer(
        devices=1,
        num_nodes=2,
        accelerator="gpu",
        strategy="ddp",
        limit_train_batches=100000,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        log_every_n_steps=1,
    )   
    print(f"Start fitting...")
    trainer.fit(model, train_dataloaders=train_data)

    print("DONE")

if __name__ == "__main__":
    main()`

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

p208p2002 commented 9 months ago

Did you set difference NODE_RANK to each node ? I currently run multi-node training with lightning v2.2.0 + deepspeed on azure's gpu cluster successfully, without manual set any env varable, (maybe it's set by the cluster system)

jamiesalter commented 1 month ago

@p208p2002 can you share how you setup Azure to work with lightning across multiple nodes? I have Azure working on multiple GPUs on a single node with DDP, but not across multiple nodes. Would love to see how you did it!

p208p2002 commented 1 month ago

Sure, but please note that what I use is a compute cluster from Azure ML Studio.

First you should create the compute cluster under AML.

then create project specific envirement, it's recommend to reference the deepspeed's official docker image

next, write some job submit script with AML Python SDK, through the SDK, you can specify which runtime to use and how to start the training program.

I can show you some part of my job submit script:


from azure.ai.ml import command, Output, MLClient, Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

...

job = command(
    display_name=...
    code="./",  # local path where the code is stored
    environment_variables={
        "TRAINING_CONFIG": TRAINING_CONFIG,
        "WANDB_API_KEY": WANDB_API_KEY,
        "HF_TOKEN": HF_TOKEN,
    },
    command="python sft_trainer.py --num_nodes ${{inputs.num_nodes}} --gpus_per_node ${{inputs.gpus_per_node}} --log_dir ${{outputs.log_dir}} fit",
    inputs=inputs,
    outputs=outputs,
    environment=os.environ["AZURE_ENVIRONMENT"],
    compute=os.environ["AZURE_COUPUTE"],
    instance_count=inputs["num_nodes"],
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": inputs["gpus_per_node"],
    },
)

# submit the command
ml_client.jobs.create_or_update(job)

the code above is not complete, you should done it youself.

you can see that in command, I pass the arg --num_nodes to sft_trainer.py, that will pass to lightning's trainer later, and in distribution, type and process_count_per_instance is set, also set instance_count greather than 1 for multi-nodes.

Finally, do some little modify for the lightning trainer:

# trainer
    trainer = Trainer(
        num_nodes=args.num_nodes,
        devices=args.gpus_per_node,
       ...
    )

The last thing you should know is that how distributed training system works. In short, the system provides some environment vaiables to identify master/slave nodes, so that can communicate to each other.

This acticle may help: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2

jamiesalter commented 1 month ago

Thanks

Sure, but please note that what I use is a compute cluster from Azure ML Studio.

First you should create the compute cluster under AML.

then create project specific envirement, it's recommend to reference the deepspeed's official docker image

next, write some job submit script with AML Python SDK, through the SDK, you can specify which runtime to use and how to start the training program.

I can show you some part of my job submit script:

from azure.ai.ml import command, Output, MLClient, Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

...

job = command(
    display_name=...
    code="./",  # local path where the code is stored
    environment_variables={
        "TRAINING_CONFIG": TRAINING_CONFIG,
        "WANDB_API_KEY": WANDB_API_KEY,
        "HF_TOKEN": HF_TOKEN,
    },
    command="python sft_trainer.py --num_nodes ${{inputs.num_nodes}} --gpus_per_node ${{inputs.gpus_per_node}} --log_dir ${{outputs.log_dir}} fit",
    inputs=inputs,
    outputs=outputs,
    environment=os.environ["AZURE_ENVIRONMENT"],
    compute=os.environ["AZURE_COUPUTE"],
    instance_count=inputs["num_nodes"],
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": inputs["gpus_per_node"],
    },
)

# submit the command
ml_client.jobs.create_or_update(job)

the code above is not complete, you should done it youself.

you can see that in command, I pass the arg --num_nodes to sft_trainer.py, that will pass to lightning's trainer later, and in distribution, type and process_count_per_instance is set, also set instance_count greather than 1 for multi-nodes.

Finally, do some little modify for the lightning trainer:

# trainer
    trainer = Trainer(
        num_nodes=args.num_nodes,
        devices=args.gpus_per_node,
       ...
    )

The last thing you should know is that how distributed training system works. In short, the system provides some environment vaiables to identify master/slave nodes, so that can communicate to each other.

This acticle may help: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-distributed-gpu?view=azureml-api-2

Thanks @p208p2002 , I got it all working on my cluster. Not experimented with DeepSpeed yet, but that looks like an interesting avenue for speed ups. Not sure my model is large enough to warrant it though (millions not billions of params for mobile).