bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
8.89k stars 489 forks source link

Add Falcon support #499

Closed borzunov closed 10 months ago

borzunov commented 10 months ago

This PR adds:

Limitations:

borzunov commented 10 months ago

Falcon-40B benchmarks

These are measured before https://github.com/bigscience-workshop/petals/pull/499/commits/4537c77004dd6a9da8df222cec663f7c62ec2dc2 that slows down inference by 1-2% (but necessary to make MQA models work properly with the rest of Petals).

H100 (80 GB):

Sep 03 14:07:43.798 [INFO] Inference throughput: 728.4 tokens/sec per block (1 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)
Sep 03 14:07:57.270 [INFO] Forward pass throughput: 93138.6 tokens/sec per block (1024 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)

A100 (80 GB):

Sep 03 13:22:40.739 [INFO] Inference throughput: 710.3 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)
Sep 03 13:22:50.803 [INFO] Forward pass throughput: 61680.6 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)

RTX A6000 Ada (48 GB):

Sep 03 15:14:46.634 [INFO] Inference throughput: 785.9 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 03 15:14:57.330 [INFO] Forward pass throughput: 62151.1 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)