0hq / WebGPT

Run GPT model on the browser with WebGPU. An implementation of GPT inference in less than ~1500 lines of vanilla Javascript.
https://kmeans.org
Other
3.61k stars 206 forks source link

De-embeddings on GPU + major temperature/top_k bug fix! #12

Closed 0hq closed 1 year ago

0hq commented 1 year ago

I was previously running de-embeddings on the GPU because of the large size of multiplying by a matrix of size (n_embd, vocab_size). The # of elements in this matrix exceeded maxStorageBufferBindingSize. For example, on an 2020 M1 Mac this is around 133 million elements when GPT-2 medium's embeddings of 768 * 50304 is around 138 million elements.

I now split this matrix (soon to be others as well) when it exceeds the maxStorageBufferBindingSize. Right now, this is done by calculating the lowest prime factor of vocab_size and chunking the calculation across the column dimension. More research needs to be done on the most efficient way of splitting matrix calculations that exceed storage limits, see comment in runGPT() in main.js.

There was also a major issue of the generate() parameters from the index.html being improperly passed, resulting in top_k param being ignored and the temperature always set to 10. This fixes a bunch of weird behavior.

vercel[bot] commented 1 year ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
web-gpt ✅ Ready (Inspect) Visit Preview Apr 22, 2023 0:58am