Open jspark1105 opened 1 year ago
Sorry for the late reply. Yes, that's correct. Currently we expect that this all-reduce happens outside of TE, which allows us to coalesce multiple all-reduces into a single NCCL call.
Hi @timmoon10 Is this still valid? Do we still need to handle the all-reduce outside of TE?
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/layernorm_linear.py#L461-L471
When we use sequence parallel we need all-reduce norm weight gradients after the code above among TP groups?