create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations.

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Apache License 2.0

1.6k stars 255 forks source link

Container:

nvcr.io/nvidia/pytorch:24.05-py3

Machine:

x86 CPU with A100 node

Reproduce:

python -m torch.distributed.run --nproc-per-node=2 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000

It'll probably crash or not because at line 102, the operator= will trigger destructor to delete the old value std::function of _alloc_copy_allgather, which is actually an uninit value. see also the definition of struct communicator

struct communicator {
...  
std::function<void(void **, void *, size_t, ExtComm)> _alloc_copy_allgather; //will not be initialized by malloc
std::function<void(ExtComm)> _barrier; //will not be initialized by malloc
std::function<void(void *)> _free; //will not be initialized by malloc

Hope this hint helps the progress.

NVIDIA / TransformerEngine