NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.6k stars 255 forks source link

create_communicator_grouped2 may trigger uninit value memory issue(randomly crash) when you train more iterations. #959

Open anderson101866 opened 1 week ago

anderson101866 commented 1 week ago

Container:

nvcr.io/nvidia/pytorch:24.05-py3

Machine:

x86 CPU with A100 node

Reproduce:

python -m torch.distributed.run --nproc-per-node=2 examples/pytorch/comm_gemm_overlap/ln_mlp_with_overlap.py --num-iters=1000

It'll probably crash or not because at line 102, the operator= will trigger destructor to delete the old value std::function of _alloc_copy_allgather, which is actually an uninit value. see also the definition of struct communicator

struct communicator {
...  
std::function<void(void **, void *, size_t, ExtComm)> _alloc_copy_allgather; //will not be initialized by malloc
std::function<void(ExtComm)> _barrier; //will not be initialized by malloc
std::function<void(void *)> _free; //will not be initialized by malloc

image Hope this hint helps the progress.

anderson101866 commented 4 days ago

The commit on this fork will fix this. https://github.com/denera/TransformerEngine/commit/7a9522bdbbe28d2682567ea450f10d87cc68d03a