[CODE] port HF bloom to hivemind

bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

https://petals.dev

MIT License

9.26k stars 524 forks source link

[CODE] port HF bloom to hivemind #2

Closed justheuristic closed 2 years ago

justheuristic commented 2 years ago

Why: so we can play with in in inference mode

Original bloom code: https://github.com/huggingface/transformers/pull/17474

The quest is to

implement bloom transformer layer as a hivemind expert.
prepare a huggingface model that only has bloom embeddings and logits, but runs all transformer layers via hivemind.RemoteExpert

justheuristic commented 2 years ago

Naively quantized bloom 6b on CPU: http://nora:8800/notebooks/decentralized/jheuristic/bloom_test/cpu-qint-matmul.ipynb source: https://gist.github.com/justheuristic/47830c9ddfd45889894e69d4f45ce233

justheuristic commented 2 years ago

On client-side computations

Since we plan to run embeddings/logits on client side, we need to compute them efficiently. Embedding computation is cheap AF, but logits are more complicated

Computing final logits on colab CPU costs over a minute per one token

Solution 1: use fast KNN

top-99% probabilities are held by the 100 most likely tokens
use HNSW to find top-100 tokens that have highest dot product
FAISS of ScaNN for fast nearest neighbor search

Solution 2: just use GPU

colab T4 in fp16: 30ms per token (no longer a bottleneck)
kudesnik m40 (~2x colab k80): 67ms, still very much acceptable
gpus might not be available

Current opinion: use GPUs, think about fast CPU mode later.

justheuristic commented 2 years ago

Copied bloom version https://github.com/huggingface/transformers/commit/ca2a55e9dfb245527b5e1c954fec6ffbb7aef07b from huggingface

Their attention code is spectacularly bad, see #1

justheuristic commented 2 years ago

Next steps:

push individual bloom layers to huggingface hub
implement BloomBlock as hivemind.moe.server.ExpertBackend

justheuristic commented 2 years ago

As of https://github.com/learning-at-home/bloom-demo/commit/82214699f26e098ba295c5c84cf119dd7a6bd1a7

Pushing model to hub is handled via python -m cli.convert_model --many_args_here, see README for usage example
Server can run forward, backward and inference of bloom blocks, see README for instructions on how to start a server