Closed justheuristic closed 2 years ago
Naively quantized bloom 6b on CPU: http://nora:8800/notebooks/decentralized/jheuristic/bloom_test/cpu-qint-matmul.ipynb source: https://gist.github.com/justheuristic/47830c9ddfd45889894e69d4f45ce233
Since we plan to run embeddings/logits on client side, we need to compute them efficiently. Embedding computation is cheap AF, but logits are more complicated
Computing final logits on colab CPU costs over a minute per one token
Solution 1: use fast KNN
Solution 2: just use GPU
Current opinion: use GPUs, think about fast CPU mode later.
Copied bloom version https://github.com/huggingface/transformers/commit/ca2a55e9dfb245527b5e1c954fec6ffbb7aef07b from huggingface
Their attention code is spectacularly bad, see #1
Next steps:
As of https://github.com/learning-at-home/bloom-demo/commit/82214699f26e098ba295c5c84cf119dd7a6bd1a7
Why: so we can play with in in inference mode
Original bloom code: https://github.com/huggingface/transformers/pull/17474
The quest is to