I have a question about tp-overlap.
The function below make a buffer for args.seq_length * args.micro_batch_size. Do this support thd format?
def _initialize_tp_communicators():
""" initializing the communicators with user buffers for high-performance tensor-model-parallel
communication overlap """
try:
import yaml
import transformer_engine
from transformer_engine.pytorch import module as te_module
except ImportError:
raise RuntimeError("Tensor Parallel Communication/GEMM Overlap optimization needs 'yaml' and "
"'transformer_engine' packages")
args = get_args()
if args.tp_comm_overlap_cfg is not None:
with open(args.tp_comm_overlap_cfg,"r") as stream:
ub_cfgs = yaml.safe_load(stream)
else:
ub_cfgs = {}
input_shape = [(args.seq_length * args.micro_batch_size) // args.context_parallel_size , args.hidden_size]
#We create a MPI process group, which is needed to bootstrap the pipelined
#tensor-model-parallel communication overlap
torch.distributed.new_group(backend='nccl')
te_module.base.initialize_ub(shape = input_shape, tp_size = args.tensor_model_parallel_size,
use_fp8 = (args.fp8 is not None) , ub_cfgs = ub_cfgs,)
Follow this question, I have found that after TP/SP mlp layer, the output shape is exactly seqlen, args.hidden_size. So how does that works for qkv_proj hidden_dim * 3/ tp_size and mlp hidden_dim * 2 / tp_size?
Hi, thank you for great works.
I have a question about tp-overlap. The function below make a buffer for
args.seq_length * args.micro_batch_size
. Do this support thd format?Follow this question, I have found that after TP/SP mlp layer, the output shape is exactly
seqlen, args.hidden_size
. So how does that works for qkv_projhidden_dim * 3/ tp_size
and mlphidden_dim * 2 / tp_size
?