Closed borzunov closed 10 months ago
These are measured before https://github.com/bigscience-workshop/petals/pull/499/commits/4537c77004dd6a9da8df222cec663f7c62ec2dc2 that slows down inference by 1-2% (but necessary to make MQA models work properly with the rest of Petals).
H100 (80 GB):
Sep 03 14:07:43.798 [INFO] Inference throughput: 728.4 tokens/sec per block (1 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)
Sep 03 14:07:57.270 [INFO] Forward pass throughput: 93138.6 tokens/sec per block (1024 tokens/batch, NVIDIA H100 PCIe GPU, bfloat16, quantized to nf4)
A100 (80 GB):
Sep 03 13:22:40.739 [INFO] Inference throughput: 710.3 tokens/sec per block (1 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)
Sep 03 13:22:50.803 [INFO] Forward pass throughput: 61680.6 tokens/sec per block (1024 tokens/batch, NVIDIA A100-SXM4-80GB GPU, bfloat16, quantized to nf4)
RTX A6000 Ada (48 GB):
Sep 03 15:14:46.634 [INFO] Inference throughput: 785.9 tokens/sec per block (1 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
Sep 03 15:14:57.330 [INFO] Forward pass throughput: 62151.1 tokens/sec per block (1024 tokens/batch, NVIDIA RTX 6000 Ada Generation GPU, bfloat16, quantized to nf4)
This PR adds:
transformers.FalconModel
(the in-library format for Falcon). Tested on Falcon-40B.--throughput dry_run
option to evaluate throughput and exit right away (implemented by @mryab).Limitations: