bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
9.26k stars 524 forks source link

[CODE] port HF bloom to hivemind #2

Closed justheuristic closed 2 years ago

justheuristic commented 2 years ago

Why: so we can play with in in inference mode

Original bloom code: https://github.com/huggingface/transformers/pull/17474

The quest is to

justheuristic commented 2 years ago

Naively quantized bloom 6b on CPU: http://nora:8800/notebooks/decentralized/jheuristic/bloom_test/cpu-qint-matmul.ipynb source: https://gist.github.com/justheuristic/47830c9ddfd45889894e69d4f45ce233

justheuristic commented 2 years ago

On client-side computations

Since we plan to run embeddings/logits on client side, we need to compute them efficiently. Embedding computation is cheap AF, but logits are more complicated

Computing final logits on colab CPU costs over a minute per one token image

Solution 1: use fast KNN

Solution 2: just use GPU

Current opinion: use GPUs, think about fast CPU mode later.

justheuristic commented 2 years ago

Copied bloom version https://github.com/huggingface/transformers/commit/ca2a55e9dfb245527b5e1c954fec6ffbb7aef07b from huggingface

Their attention code is spectacularly bad, see #1

justheuristic commented 2 years ago

Next steps:

justheuristic commented 2 years ago

As of https://github.com/learning-at-home/bloom-demo/commit/82214699f26e098ba295c5c84cf119dd7a6bd1a7