I was previously running de-embeddings on the GPU because of the large size of multiplying by a matrix of size (n_embd, vocab_size). The # of elements in this matrix exceeded maxStorageBufferBindingSize. For example, on an 2020 M1 Mac this is around 133 million elements when GPT-2 medium's embeddings of 768 * 50304 is around 138 million elements.
I now split this matrix (soon to be others as well) when it exceeds the maxStorageBufferBindingSize. Right now, this is done by calculating the lowest prime factor of vocab_size and chunking the calculation across the column dimension. More research needs to be done on the most efficient way of splitting matrix calculations that exceed storage limits, see comment in runGPT() in main.js.
There was also a major issue of the generate() parameters from the index.html being improperly passed, resulting in top_k param being ignored and the temperature always set to 10. This fixes a bunch of weird behavior.
I was previously running de-embeddings on the GPU because of the large size of multiplying by a matrix of size (n_embd, vocab_size). The # of elements in this matrix exceeded maxStorageBufferBindingSize. For example, on an 2020 M1 Mac this is around 133 million elements when GPT-2 medium's embeddings of 768 * 50304 is around 138 million elements.
I now split this matrix (soon to be others as well) when it exceeds the maxStorageBufferBindingSize. Right now, this is done by calculating the lowest prime factor of vocab_size and chunking the calculation across the column dimension. More research needs to be done on the most efficient way of splitting matrix calculations that exceed storage limits, see comment in runGPT() in main.js.
There was also a major issue of the generate() parameters from the index.html being improperly passed, resulting in top_k param being ignored and the temperature always set to 10. This fixes a bunch of weird behavior.