Our model uses a lot of parameters for the output layer. Specifically, 2 * vocab_size * devices * features, where features=256 and devices=256 for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.\
For example, ALBERT used factorized embeddings, reducing the number of parameters from 256*256*vocab = 8.59B to 256*256*sqrt(vocab)*2 = 33.5M .
Our model uses a lot of parameters for the output layer. Specifically,
2 * vocab_size * devices * features
, wherefeatures=256
anddevices=256
for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.\ For example, ALBERT used factorized embeddings, reducing the number of parameters from256*256*vocab = 8.59B
to256*256*sqrt(vocab)*2 = 33.5M
.