[BUG] LM head weights get untied while training with overlap

mayank31398 commented 7 months ago

LM head weights get untied during training even when they are supposed to be tied. This is happening when overlap parameters are set to true.

cc: @deepakn94

deepakn94 commented 7 months ago

Can you provide an example script? And how are you inspecting the parameters?

mayank31398 commented 7 months ago

@deepakn94 I used 4 GPUs with a 3B param model. I added the following statements to the train_step function to print the tensors:

    if mpu.get_tensor_model_parallel_rank() == 0:
        if mpu.is_pipeline_first_stage():
            print(model[0].module.module.language_model.embedding.word_embeddings.weight)
    torch.distributed.barrier()
    if mpu.get_tensor_model_parallel_rank() == 0:
        if mpu.is_pipeline_last_stage():
            print(model[0].module.module.word_embeddings.weight)
    torch.distributed.barrier()
    print("-" * 50)
    torch.distributed.barrier()

Note that this only happens when pipeline parallel is enabled. I have started seeing this after the introduction of overlapping backward pass in this repo.

you can use this script to reproduce, also attaching logs

# A100 80GB

export NCCL_SOCKET_IFNAME="ib,bond"
export NCCL_IB_CUDA_SUPPORT=1
export NCCL_IB_PCI_RELAXED_ORDERING=1
export UCX_IB_PCI_RELAXED_ORDERING=on
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_NTHREADS=2
export NCCL_NSOCKS_PERTHREAD=4
export CUDA_DEVICE_MAX_CONNECTIONS=1

MASTER_ADDR=$(echo ${LSB_MCPU_HOSTS} | tr ' ' '\n' | head -n 1)
MASTER_PORT=5${LSB_JOBID: -5:-1}
NNODES=$(echo ${LSB_MCPU_HOSTS} | tr ' ' '\n' | sed 'n; d' | wc -w)
GPUS_PER_NODE=$(echo $CUDA_VISIBLE_DEVICES | tr ',' '\n' | wc -w)
NODE_RANK=$(($(echo ${LSB_MCPU_HOSTS} | tr ' ' '\n' | sed 'n; d' | grep -n -m1 $HOSTNAME | cut -d':' -f1)-1))

DISTRIBUTED_ARGS="\
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
"

GPT_ARGS="--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 2 \
--num-layers 32 \
--hidden-size 3072 \
--num-attention-heads 32 \
--init-method-std 0.01275 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--attention-dropout 0.1 \
--hidden-dropout 0.1 \
--micro-batch-size 1 \
--global-batch-size 2 \
--lr 0.0003 \
--min-lr 0.00003 \
--train-iters 510000 \
--lr-decay-iters 510000 \
--lr-decay-style constant \
--weight-decay .1 \
--adam-beta2 .95 \
--clip-grad 1.0 \
--bf16 \
--use-flash-attn \
--log-interval 10 \
--save-interval 2000 \
--eval-interval 5000000000 \
--eval-iters 2 \
--use-distributed-optimizer \
--tokenizer-type NullTokenizer \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-cache-path ./cache \
--sequence-parallel \
--distributed-timeout-minutes 120 \
--finetune \
--vocab-size 49152"

torchrun $DISTRIBUTED_ARGS \
    pretrain_gpt.py \
    $GPT_ARGS \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \
    --data-path /dataset/bluepile/g20bc_starcoder_tokens2_megatron/lang=Python

at the beginning (they match exactly), the tensors are:

tensor([[ 0.0103, -0.0074,  0.0206,  ...,  0.0034, -0.0028, -0.0205],
        [ 0.0076, -0.0069, -0.0001,  ...,  0.0129,  0.0171, -0.0015],
        [-0.0142, -0.0120, -0.0104,  ..., -0.0024,  0.0121, -0.0005],
        ...,
        [ 0.0126, -0.0068,  0.0016,  ..., -0.0097,  0.0049,  0.0047],
        [ 0.0025,  0.0100, -0.0010,  ..., -0.0078, -0.0209, -0.0128],
        [-0.0222,  0.0206, -0.0101,  ..., -0.0168,  0.0177,  0.0025]],
       device='cuda:1', dtype=torch.bfloat16, requires_grad=True)
Parameter containing:
tensor([[ 0.0103, -0.0074,  0.0206,  ...,  0.0034, -0.0028, -0.0205],
        [ 0.0076, -0.0069, -0.0001,  ...,  0.0129,  0.0171, -0.0015],
        [-0.0142, -0.0120, -0.0104,  ..., -0.0024,  0.0121, -0.0005],
        ...,
        [ 0.0126, -0.0068,  0.0016,  ..., -0.0097,  0.0049,  0.0047],
        [ 0.0025,  0.0100, -0.0010,  ..., -0.0078, -0.0209, -0.0128],
        [-0.0222,  0.0206, -0.0101,  ..., -0.0168,  0.0177,  0.0025]],
       device='cuda:0', dtype=torch.bfloat16, requires_grad=True)
Parameter containing:
tensor([[ 0.0103, -0.0074,  0.0206,  ...,  0.0034, -0.0028, -0.0205],
        [ 0.0076, -0.0069, -0.0001,  ...,  0.0129,  0.0171, -0.0015],
        [-0.0142, -0.0120, -0.0104,  ..., -0.0024,  0.0121, -0.0005],
        ...,
        [ 0.0126, -0.0068,  0.0016,  ..., -0.0097,  0.0049,  0.0047],
        [ 0.0025,  0.0100, -0.0010,  ..., -0.0078, -0.0209, -0.0128],
        [-0.0222,  0.0206, -0.0101,  ..., -0.0168,  0.0177,  0.0025]],
       device='cuda:3', dtype=torch.bfloat16, requires_grad=True)
Parameter containing:
tensor([[ 0.0103, -0.0074,  0.0206,  ...,  0.0034, -0.0028, -0.0205],
        [ 0.0076, -0.0069, -0.0001,  ...,  0.0129,  0.0171, -0.0015],
        [-0.0142, -0.0120, -0.0104,  ..., -0.0024,  0.0121, -0.0005],
        ...,
        [ 0.0126, -0.0068,  0.0016,  ..., -0.0097,  0.0049,  0.0047],
        [ 0.0025,  0.0100, -0.0010,  ..., -0.0078, -0.0209, -0.0128],
        [-0.0222,  0.0206, -0.0101,  ..., -0.0168,  0.0177,  0.0025]],
       device='cuda:2', dtype=torch.bfloat16, requires_grad=True)

after 220 steps:

Parameter containing:
tensor([[ 0.0133, -0.0061,  0.0311,  ..., -0.0013,  0.0018, -0.0094],
        [ 0.0045, -0.0047, -0.0002,  ...,  0.0128,  0.0019, -0.0061],
        [-0.0176, -0.0069, -0.0096,  ..., -0.0018, -0.0137, -0.0058],
        ...,
        [ 0.0095, -0.0048,  0.0002,  ..., -0.0094, -0.0166, -0.0002],
        [-0.0006,  0.0121, -0.0013,  ..., -0.0071, -0.0427, -0.0184],
        [-0.0261,  0.0259, -0.0095,  ..., -0.0173, -0.0046, -0.0029]],
       device='cuda:0', dtype=torch.bfloat16, requires_grad=True)Parameter containing:
tensor([[ 0.0133, -0.0061,  0.0311,  ..., -0.0013,  0.0018, -0.0094],
        [ 0.0045, -0.0047, -0.0002,  ...,  0.0128,  0.0019, -0.0061],
        [-0.0176, -0.0069, -0.0096,  ..., -0.0018, -0.0137, -0.0058],
        ...,
        [ 0.0095, -0.0048,  0.0002,  ..., -0.0094, -0.0166, -0.0002],
        [-0.0006,  0.0121, -0.0013,  ..., -0.0071, -0.0427, -0.0184],
        [-0.0261,  0.0259, -0.0095,  ..., -0.0173, -0.0046, -0.0029]],
       device='cuda:1', dtype=torch.bfloat16, requires_grad=True)

Parameter containing:
tensor([[ 1.6235e-02, -7.6904e-03,  3.1128e-02,  ..., -2.7084e-04,
         -1.8768e-03, -1.2207e-02],
        [ 4.0588e-03, -3.8605e-03, -8.6308e-05,  ...,  1.3123e-02,
          1.6022e-03, -6.1340e-03],
        [-1.7944e-02, -5.2795e-03, -8.8501e-03,  ..., -8.0109e-04,
         -1.5259e-02, -5.7678e-03],
        ...,
        [ 9.1553e-03, -3.4790e-03,  6.8283e-04,  ..., -8.3618e-03,
         -1.8677e-02, -1.8501e-04],
        [-1.0529e-03,  1.3611e-02, -8.6975e-04,  ..., -6.1340e-03,
         -4.4434e-02, -1.8311e-02],
        [-2.6611e-02,  2.7344e-02, -9.0332e-03,  ..., -1.6602e-02,
         -5.4932e-03, -2.9449e-03]], device='cuda:2', dtype=torch.bfloat16,
       requires_grad=True)Parameter containing:
tensor([[ 1.6235e-02, -7.6904e-03,  3.1128e-02,  ..., -2.7084e-04,
         -1.8768e-03, -1.2207e-02],
        [ 4.0588e-03, -3.8605e-03, -8.6308e-05,  ...,  1.3123e-02,
          1.6022e-03, -6.1340e-03],
        [-1.7944e-02, -5.2795e-03, -8.8501e-03,  ..., -8.0109e-04,
         -1.5259e-02, -5.7678e-03],
        ...,
        [ 9.1553e-03, -3.4790e-03,  6.8283e-04,  ..., -8.3618e-03,
         -1.8677e-02, -1.8501e-04],
        [-1.0529e-03,  1.3611e-02, -8.6975e-04,  ..., -6.1340e-03,
         -4.4434e-02, -1.8311e-02],
        [-2.6611e-02,  2.7344e-02, -9.0332e-03,  ..., -1.6602e-02,
         -5.4932e-03, -2.9449e-03]], device='cuda:3', dtype=torch.bfloat16,
       requires_grad=True)

This is only happening when pipeline parallel is enabled.

Looking forward to your reply 😃

deepakn94 commented 7 months ago

Will look into it, thanks.

A few more questions:

What version of the codebase are you using?
I don't see you using --overlap-grad-reduce, though the issue says "while training with overlap"?
Could you see if you run into the same issue with this argument: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/arguments.py#L917C25-L917C58?

mayank31398 commented 7 months ago

@deepakn94 I saw issues regardless of --overlap-grad-reduce. The above tensor logs are without overlap though.

deepakn94 commented 7 months ago

Ack.

Commit hash?

mayank31398 commented 7 months ago

latest commit. I did a fresh clone a few hours ago :). you can use this: 6140bf8deb6f944f3bf1b71e5341b9489b42217a.

Also, the issue doesn't happen when data parallel is off (case of 2 GPUs).

deepakn94 commented 7 months ago

Sounds good. Will get back to you as soon as I can.

mayank31398 commented 7 months ago

Hey, did you get a chance to look into this?

deepakn94 commented 7 months ago

Sorry, was traveling yesterday from India. Will get to this best cast next week, following week more likely.

mayank31398 commented 7 months ago

Hey Deepak, following up on this any progress?

deepakn94 commented 7 months ago

Haven't looked at this in earnest yet. Will look this week.

Have you tried just turning off weight tying? This is what we do by default now, which is partially why this fell through the cracks.

mayank31398 commented 7 months ago

The training is progressing fine. There are no issues with that. Its just that the parameter is not getting tied working correctly.

deepakn94 commented 7 months ago

I am able to reproduce (including the fact that tying works without data parallelism, but doesn't work with data parallelism). Will work on a fix. Thanks for pointing this out!

deepakn94 commented 7 months ago

Also seems like the mismatch only happens with --use-distributed-optimizer? If I remove this flag, things seem to work as expected? Is this your experience as well, @mayank31398 ?

mayank31398 commented 7 months ago

Yeah I think it was working without distributed optimizer but I can't say for sure.

mayank31398 commented 6 months ago

Hi @deepakn94 any updates on this one?

deepakn94 commented 6 months ago

Hi @mayank31398, I figured out the issue. The commit that introduced overlapping of reduce-scatter with the backward pass changed the order of the reduce-scatter of gradients across data-parallel replicas and the all-reduce of the embedding gradients across the first and last pipeline stage (we used to do the embedding all-reduce before the data-parallel reduce-scatter). Since the reduce-scatter now runs overlapped with the backward pass, we need the data-parallel reduce-scatter to run before the embedding all-reduce, but this leads to issues if the embeddings in the first and last pipeline stages are not partitioned across data-parallel replicas for the distributed optimizer in the same way. This is a figure that illustrates this issue:

I have a fix: assign the embedding in each pipeline stage to its own separate bucket to ensure that the partitioning strategy is the same across both pipeline stages. Still working on cleaning it up and then it will need to go through our internal review process before it makes its way to Github. Let me know if any of this doesn't make sense!

mayank31398 commented 6 months ago

Yup, I had the same understanding. I think the explanation makes sense. Thanks a lot for this. I will wait for the fix :)

deepakn94 commented 6 months ago

Fix is here: https://github.com/NVIDIA/Megatron-LM/commit/db2040f7ebdda99c18125936376fe30119267e6b. Please let me know if it works for you too. Thanks, @mayank31398!

mayank31398 commented 6 months ago

Hey, thanks I will take a look over this weekend.

deepakn94 commented 6 months ago

Hi @mayank31398, are you still seeing this issue? If not, can I close this?

mayank31398 commented 5 months ago

Hey @deepakn94, I am currently running really busy so haven't gotten a chance to try this. You can close this for now. I will test it sometime soon and re-open if there are issues.

deepakn94 commented 5 months ago

Sounds good, let me know.

NVIDIA / Megatron-LM

[BUG] LM head weights get untied while training with overlap #656