[QUESTION] Do tp overlap support thd, whose sequence length is flexible?

Hi, thank you for great works.

I have a question about tp-overlap. The function below make a buffer for args.seq_length * args.micro_batch_size. Do this support thd format?

def _initialize_tp_communicators():
    """ initializing the communicators with user buffers for high-performance tensor-model-parallel 
        communication overlap """

    try:
       import yaml

       import transformer_engine
       from transformer_engine.pytorch import module as te_module

    except ImportError:
       raise RuntimeError("Tensor Parallel Communication/GEMM Overlap optimization needs 'yaml' and "
             "'transformer_engine' packages") 

    args = get_args()

    if args.tp_comm_overlap_cfg is not None:
       with open(args.tp_comm_overlap_cfg,"r") as stream:    
          ub_cfgs = yaml.safe_load(stream)
    else:
       ub_cfgs = {}

    input_shape = [(args.seq_length * args.micro_batch_size) // args.context_parallel_size , args.hidden_size]

    #We create a MPI process group, which is needed to bootstrap the pipelined 
    #tensor-model-parallel communication overlap
    torch.distributed.new_group(backend='nccl')
    te_module.base.initialize_ub(shape = input_shape, tp_size = args.tensor_model_parallel_size, 
                                 use_fp8 = (args.fp8 is not None) , ub_cfgs = ub_cfgs,)

Follow this question, I have found that after TP/SP mlp layer, the output shape is exactly seqlen, args.hidden_size. So how does that works for qkv_proj hidden_dim * 3/ tp_size and mlp hidden_dim * 2 / tp_size?

NVIDIA / Megatron-LM

[QUESTION] Do tp overlap support thd, whose sequence length is flexible? #1238