Closed kshitijkg closed 1 year ago
Hi, I have tested on a small model with pp=1 mp=1, but the output of model looks fine. Did you change this? maybe our code didn't make model sequential now https://github.com/floatingsnake/gpt-neox/blob/73cdd8692be8a2c579444434e60a01450a8c9a3c/megatron/neox_arguments/arguments.py#L992
https://github.com/floatingsnake/gpt-neox/blob/magma/mytests/test_model_build.py https://github.com/floatingsnake/gpt-neox/blob/magma/configs/summit-70m-openclipH.yml#L16-L17
part of output:
)
(6): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(attention): AdapterWrapper(
(adapter): Sequential(
(0): Linear(in_features=512, out_features=64, bias=True)
(1): ReLU()
(2): Linear(in_features=64, out_features=512, bias=True)
)
(attn_block): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0, inplace=False)
(dense): RowParallelLinear()
)
)
(post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(mlp): AdapterWrapper(
(adapter): Sequential(
(0): Linear(in_features=512, out_features=64, bias=True)
(1): ReLU()
(2): Linear(in_features=64, out_features=512, bias=True)
)
(attn_block): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
(7): ParallelTransformerLayerPipe(
(input_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(attention): AdapterWrapper(
(adapter): Sequential(
(0): Linear(in_features=512, out_features=64, bias=True)
(1): ReLU()
(2): Linear(in_features=64, out_features=512, bias=True)
)
(attn_block): ParallelSelfAttention(
(query_key_value): ColumnParallelLinear()
(rotary_emb): RotaryEmbedding()
(scale_mask_softmax): FusedScaleMaskSoftmax()
(attention_dropout): Dropout(p=0, inplace=False)
(dense): RowParallelLinear()
)
)
(post_attention_layernorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(mlp): AdapterWrapper(
(adapter): Sequential(
(0): Linear(in_features=512, out_features=64, bias=True)
(1): ReLU()
(2): Linear(in_features=64, out_features=512, bias=True)
)
(attn_block): ParallelMLP(
(dense_h_to_4h): ColumnParallelLinear()
(dense_4h_to_h): RowParallelLinear()
)
)
)
(9): NormPipe(
(norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(10): ParallelLinearPipe(
(final_linear): ColumnParallelLinear()
)
)
)
Current GPU memory usage: 9.01 GB
As we abandond the sequential wrapper. and mp=1, pp=1 works will without it. We can reopen the issue when it is needed
Yes, I had changed that line to test sequential wrapper. But yeah, solving this is not high priority for now since we are moving away from the sequential wrapper :)
When training without pipeline parallelism, the sequential wrapper is used: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/training.py#L461. Code for to_sequential: https://github.com/floatingsnake/gpt-neox/blob/magma/megatron/model/gpt2_model.py#L343
However, all the adapters added are lost when this is done.
This is probably because the model is rebuilt with self.specs which wasnt updated when the adapters are added.