HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Compact Loss #79

Open ClashLuke opened 1 year ago

ClashLuke commented 1 year ago

Our model uses a lot of parameters for the output layer. Specifically, 2 * vocab_size * devices * features, where features=256 and devices=256 for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.\ For example, ALBERT used factorized embeddings, reducing the number of parameters from 256*256*vocab = 8.59B to 256*256*sqrt(vocab)*2 = 33.5M .