Adding NVPL BLAS support

nicholaiTukanov commented 3 months ago

Prerequisites

[X] I am running the latest code. Mention the version if possible as well.
[X] I carefully followed the README.md.
[X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Hi all! Amazing work on llama.cpp!

I am an engineer from NVIDIA working on NVPL BLAS (BLAS library designed for NVIDIA Grace CPU).

I would like to add NVPL BLAS as a build option in the Makefile and ggml-blas.cpp. I have found it to provide better performance over GGML for less than 32 threads when using the prompt test from llama-bench from version b3322.

My changes can be found here https://github.com/nicholaiTukanov/llama.cpp/tree/ntukanov/add-nvpl. Please let me know if there is anything else I need to do to get this approved. Thank you!

CPU	Model	Model Size [GiB]	Threads	Test	t/s master	t/s nt/nvpl-blas	Speedup
Grace C2	llama 7B Q8_0	6.67	1	pp512	5.93	7.03	1.19
Grace C2	llama 7B Q8_0	6.67	2	pp512	12.12	13.97	1.15
Grace C2	llama 7B Q8_0	6.67	4	pp512	24.55	27.81	1.13
Grace C2	llama 7B Q8_0	6.67	8	pp512	50.19	55.49	1.11
Grace C2	llama 7B Q8_0	6.67	16	pp512	100.34	107.07	1.07
Grace C2	llama 7B Q8_0	6.67	32	pp512	197.88	205.09	1.04
Grace C2	llama 7B Q8_0	6.67	64	pp512	371.18	355.62	0.96
Grace C2	llama 7B Q8_0	6.67	72	pp512	398.27	364.31	0.91

Motivation

This will provide better prompt performance for aarch64 users. See table in issue.

Possible Implementation

Add GGML_NVPL build option into Makefile
Add NVPL_ENABLE_CBLAS code path into ggml-blas.cpp
- Includes nvpl_blas.h and sets the number of threads for NVPL BLAS using nvpl_blas_set_num_threads()

ggerganov commented 3 months ago

Seems OK to add - feel free to open PR

nicholaiTukanov commented 2 months ago

Closing since #8425 has been merged. Thank you all for your help.

ggerganov / llama.cpp