ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.89k stars 9.46k forks source link

Adding NVPL BLAS support #8329

Closed nicholaiTukanov closed 2 months ago

nicholaiTukanov commented 3 months ago

Prerequisites

Feature Description

Hi all! Amazing work on llama.cpp!

I am an engineer from NVIDIA working on NVPL BLAS (BLAS library designed for NVIDIA Grace CPU).

I would like to add NVPL BLAS as a build option in the Makefile and ggml-blas.cpp. I have found it to provide better performance over GGML for less than 32 threads when using the prompt test from llama-bench from version b3322.

My changes can be found here https://github.com/nicholaiTukanov/llama.cpp/tree/ntukanov/add-nvpl. Please let me know if there is anything else I need to do to get this approved. Thank you!

CPU Model Model Size [GiB] Threads Test t/s master t/s nt/nvpl-blas Speedup
Grace C2 llama 7B Q8_0 6.67 1 pp512 5.93 7.03 1.19
Grace C2 llama 7B Q8_0 6.67 2 pp512 12.12 13.97 1.15
Grace C2 llama 7B Q8_0 6.67 4 pp512 24.55 27.81 1.13
Grace C2 llama 7B Q8_0 6.67 8 pp512 50.19 55.49 1.11
Grace C2 llama 7B Q8_0 6.67 16 pp512 100.34 107.07 1.07
Grace C2 llama 7B Q8_0 6.67 32 pp512 197.88 205.09 1.04
Grace C2 llama 7B Q8_0 6.67 64 pp512 371.18 355.62 0.96
Grace C2 llama 7B Q8_0 6.67 72 pp512 398.27 364.31 0.91

Motivation

This will provide better prompt performance for aarch64 users. See table in issue.

Possible Implementation

ggerganov commented 3 months ago

Seems OK to add - feel free to open PR

nicholaiTukanov commented 2 months ago

Closing since #8425 has been merged. Thank you all for your help.