In the first stage, a smaller decoder trunk consisting of 8 Transformer blocks with width 1024, rotary positional embeddings, and MLPs is trained to only predict backbone coordinates. In the second stage, the decoder weights are re-initialized and the network size is expanded to 30 layers, each with an embedding dimension of 1280 (∼600M parameters) to predict all atom coordinates.
The embedding dimension or width must be divisible by the number of heads. The number of heads is 20 for the large decoder with an embedding dimension of 1280.
In the paper, we can read
The embedding dimension or width must be divisible by the number of heads. The number of heads is 20 for the large decoder with an embedding dimension of 1280.
https://github.com/evolutionaryscale/esm/blob/95e3c5be8acda407414810ff3aa7d27dbb6e30d3/esm/pretrained.py#L55
What is the number of heads for the small decoder (width 1024), 16 heads ?