[BUG] Mismatch Between Docstring and Behavior in core.tensor_parallel.random.model_parallel_cuda_manual_seed

Describe the bug The following function (in megatron.core.tensor_parallel.random) is called when we initialize the random seeds. Now I am suspecting the behavior of this function doesn't match the docstring, even if we consider different ways of passing seeds to the function:

One unified input seed for all ranks: If this is true, then all ranks will have the same data_parallel_seed or "default state", while the docstring says: "default state: This is for data parallelism and is (...) different across different model paralle groups."
One unified input seed for one DP group: If this is true, then different DP and MP rank will both lead to different tensor_model_parallel_seed or "tensor-model-parallel state", while the docstring says: "tensor-model-parallel state: This state is (...) the same across data parallel groups."

def model_parallel_cuda_manual_seed(seed):
    """Initialize model parallel cuda seed.

    This function should be called after the model parallel is
    initialized. Also, no torch.cuda.manual_seed should be called
    after this function. Basically, this is replacement for that
    function.
    Two set of RNG states are tracked:
    default state: This is for data parallelism and is the same among a set of model parallel GPUs but different across different model paralle groups. This is used for example for dropout in the non-tensor-model-parallel regions.
    tensor-model-parallel state: This state is different among a set of model parallel GPUs, but the same across data parallel groups. This is used for example for dropout in model parallel regions.
    """
    # 2718 is just for fun and any POSITIVE value will work.
    offset = seed + 2718
    tensor_model_parallel_seed = offset + get_tensor_model_parallel_rank()
    # Data parallel gets the original seed.
    data_parallel_seed = seed

    initialize_rng_tracker()
    _CUDA_RNG_STATE_TRACKER.reset()
    # Set the default state.
    torch.cuda.manual_seed(data_parallel_seed)
    _CUDA_RNG_STATE_TRACKER.add(_DATA_PARALLEL_RNG_TRACKER_NAME, data_parallel_seed)

    # and model parallel state.
    _CUDA_RNG_STATE_TRACKER.add(_MODEL_PARALLEL_RNG_TRACKER_NAME, tensor_model_parallel_seed)

    expert_parallel_seed = (
        seed + 1024 + 100 * get_expert_model_parallel_rank() + get_tensor_model_parallel_rank()
    )
    _CUDA_RNG_STATE_TRACKER.add(_EXPERT_PARALLEL_RNG_TRACKER_NAME, expert_parallel_seed)

To Reproduce N/A

Expected behavior I guess the docstring states the intention (and is the expected behavior) of this function

Stack trace/logs N/A

Environment (please complete the following information):

Megatron-LM commit ID c4d12e26b2dc25a2eab7da92e2ac30338c0ed3de
PyTorch version N/A
CUDA version N/A
NCCL version N/A

Proposed fix N/A

Additional context I suspect model_parallel_cuda_manual_seed are meant to take care of all the details about seeding in different parallelisms. If this is true, then the following usage (in megatron/training/initialize.py) seems problematic:

def _set_random_seed(seed_, data_parallel_random_init=False):
    """Set random seed for reproducability."""
    if seed_ is not None and seed_ > 0:
        # Ensure that different pipeline MP stages get different seeds.
        seed = seed_ + (100 * mpu.get_pipeline_model_parallel_rank())
        # Ensure different data parallel ranks get different seeds
        if data_parallel_random_init:
            seed = seed + (10 * mpu.get_data_parallel_rank())
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        if torch.cuda.device_count() > 0:
            tensor_parallel.model_parallel_cuda_manual_seed(seed)
    else:
        raise ValueError("Seed ({}) should be a positive integer.".format(seed))

NVIDIA / Megatron-LM

[BUG] Mismatch Between Docstring and Behavior in core.tensor_parallel.random.model_parallel_cuda_manual_seed #858