An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
6.95k
stars
1.02k
forks
source link
Training crashes when "(hidden_size * num_kv_heads) / (num_attention_heads * num_attention_heads)" is not an integer. #1314
Closed
tiandeyu-cs closed 1 week ago
Describe the bug Training crashes when "(hidden_size num_kv_heads) / (num_attention_heads num_attention_heads)" is not an integer.
To Reproduce Train a model with configuration as follows:
It will crash with the following error: