Training without Pipeline Parallelism

kshitijkg commented 1 year ago

When training without pipeline parallelism, the sequential wrapper is used: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/training.py#L461. Code for to_sequential: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/model/gpt2_model.py#L343

However, all the adapters added are lost when this is done.

This is probably because the model is rebuilt with self.specs which wasnt updated when the adapters are added.

floatingbigcat commented 1 year ago

Hi, I have tested on a small model with pp=1 mp=1, but the output of model looks fine. Did you change this? maybe our code didn't make model sequential now https://github.com/floatingsnake/gpt-neox/blob/73cdd8692be8a2c579444434e60a01450a8c9a3c/megatron/neox_arguments/arguments.py#L992

https://github.com/floatingsnake/gpt-neox/blob/magma/mytests/test_model_build.py https://github.com/floatingsnake/gpt-neox/blob/magma/configs/summit-70m-openclipH.yml#L16-L17

part of output:

    )
    (6): ParallelTransformerLayerPipe(
      (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (attention): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelSelfAttention(
          (query_key_value): ColumnParallelLinear()
          (rotary_emb): RotaryEmbedding()
          (scale_mask_softmax): FusedScaleMaskSoftmax()
          (attention_dropout): Dropout(p=0, inplace=False)
          (dense): RowParallelLinear()
        )
      )
      (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (mlp): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelMLP(
          (dense_h_to_4h): ColumnParallelLinear()
          (dense_4h_to_h): RowParallelLinear()
        )
      )
    )
    (7): ParallelTransformerLayerPipe(
      (input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (attention): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelSelfAttention(
          (query_key_value): ColumnParallelLinear()
          (rotary_emb): RotaryEmbedding()
          (scale_mask_softmax): FusedScaleMaskSoftmax()
          (attention_dropout): Dropout(p=0, inplace=False)
          (dense): RowParallelLinear()
        )
      )
      (post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (mlp): AdapterWrapper(
        (adapter): Sequential(
          (0): Linear(in_features=512, out_features=64, bias=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=512, bias=True)
        )
        (attn_block): ParallelMLP(
          (dense_h_to_4h): ColumnParallelLinear()
          (dense_4h_to_h): RowParallelLinear()
        )
      )
    )
    (9): NormPipe(
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (10): ParallelLinearPipe(
      (final_linear): ColumnParallelLinear()
    )
  )
)
Current GPU memory usage: 9.01 GB

floatingbigcat commented 1 year ago

As we abandond the sequential wrapper. and mp=1, pp=1 works will without it. We can reopen the issue when it is needed

kshitijkg commented 1 year ago

Yes, I had changed that line to test sequential wrapper. But yeah, solving this is not high priority for now since we are moving away from the sequential wrapper :)

CERC-AAI / multimodal

Training without Pipeline Parallelism #5