VQ VAE structure decoder embedding dimension and heads

In the paper, we can read

In the first stage, a smaller decoder trunk consisting of 8 Transformer blocks with width 1024, rotary positional embeddings, and MLPs is trained to only predict backbone coordinates. In the second stage, the decoder weights are re-initialized and the network size is expanded to 30 layers, each with an embedding dimension of 1280 (∼600M parameters) to predict all atom coordinates.

The embedding dimension or width must be divisible by the number of heads. The number of heads is 20 for the large decoder with an embedding dimension of 1280.

https://github.com/evolutionaryscale/esm/blob/95e3c5be8acda407414810ff3aa7d27dbb6e30d3/esm/pretrained.py#L55

What is the number of heads for the small decoder (width 1024), 16 heads ?

evolutionaryscale / esm

VQ VAE structure decoder embedding dimension and heads #48