Training crashes when "(hidden_size * num_kv_heads) / (num_attention_heads * num_attention_heads)" is not an integer.

EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

https://www.eleuther.ai/

Apache License 2.0

6.95k stars 1.02k forks source link

Closed tiandeyu-cs closed 1 week ago

tiandeyu-cs commented 3 weeks ago

Describe the bug Training crashes when "(hidden_size num_kv_heads) / (num_attention_heads num_attention_heads)" is not an integer.

To Reproduce Train a model with configuration as follows:

{
    "hidden_size": 5120,
    "num_attention_heads": 40,
    "num_kv_heads": 8,
    "seq_length": 4096,
    "batch_size": 1
}

It will crash with the following error:

RuntimeError: shape '[4096, 1, 5, 179]' is invalid for input of size 3670016