ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.93k stars 9.74k forks source link

Performance investigation using AMD BLIS instead of OpenBLAS on 16 core AMD Zen1 #637

Closed gjmulder closed 1 year ago

gjmulder commented 1 year ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Compiling against AMD optimized BLS implementation of BLAS allows me to run perplexity tests

Current Behavior

Compiling against AMD optimized BLS implementation of BLAS causes perplexity command to process 0 chunks

llama.cpp$ git log | head -1
commit 3df890aef432ce68143cfafcd7caf828bc4c3e55
llama.cpp$ python3 --version
Python 3.10.9
llama.cpp$ make --version | head -1
GNU Make 4.3
llama.cpp$ g++ --version | head -1
g++ (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0

Steps to Reproduce

  1. Install latest bliss libs from github
blis$ sudo make install
Installing libblis.a into /usr/local/lib/
Installing libblis.so.4.0.0 into /usr/local/lib/
Generating monolithic cblas.h.........
Generated include/zen/cblas.h
Installing blis.h cblas.h blis.hh cblas.hh into /usr/local/include/blis/
Installing config.mk common.mk into /usr/local/share/blis/
Installing config/zen/make_defs.mk into /usr/local/share/blis/config/zen
mkdir -p /usr/local/share/pkgconfig
Installing blis.pc into /usr/local/share/pkgconfig/
install -c -m 0644 blis.pc /usr/local/share/pkgconfig
  1. Update Makefile to use blis instead of blas
llama.cpp$ diff Makefile.bliss Makefile.dist 
183,184c183,184
<   CFLAGS  += -DGGML_USE_OPENBLAS -I/usr/local/include/blis
<   LDFLAGS += -lblis
---
>   CFLAGS  += -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
>   LDFLAGS += -lopenblas
  1. Compile against blis, perplexity processes 0 chunks

174 second run just calling ./main linked against OpenBLAS:

llama.cpp$ make -f Makefile.dist clean && LLAMA_OPENBLAS=1 make -f Makefile.dist;ldd ./main;time ./main -t 16 -m ./models/7B/ggml-model-q4_0.bin -b 256 -n 512 -p "blis or blas"
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0
I CXX:      g++ (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0

rm -vf *.o main quantize perplexity embedding
removed 'common.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'perplexity'
removed 'embedding'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0
I CXX:      g++ (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/openblas   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/main/main.cpp ggml.o llama.o common.o -o main -lopenblas

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lopenblas
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity -lopenblas
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding -lopenblas
    linux-vdso.so.1 (0x00007ffd8c7a7000)
    libopenblas.so.0 => /lib/x86_64-linux-gnu/libopenblas.so.0 (0x00007f3bb8880000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f3bb8656000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f3bb856f000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f3bb854f000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3bb8327000)
    libgfortran.so.5 => /lib/x86_64-linux-gnu/libgfortran.so.5 (0x00007f3bb804a000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f3bbadd8000)
    libquadmath.so.0 => /lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f3bb8002000)
main: seed = 1680212228
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 256, n_predict = 512, n_keep = 0

 blis or blas
"Blas," said the voice in the dark. "There is a Blas. He's been here all day long. He hasn't moved. I think he might be asleep."
I could hear breathing, but it was too distant to place where it was coming from. Blas. The name came to me from some forgotten dream. I couldn't recall why I had remembered it or where the thought had come from, but the name was there in my head: Blas. And then it was gone again, like a bird in flight.
"Is he still here?" said the voice. "I can't see him."
"Here," I said. "Yes." The word hung on the air of our cave, suspended between us. But that word was also gone.
We had been together for three days now. Three days ago we had met in the woods; on day two we had found a cave deep in the forest and had made it into our own little world. Now it was nighttime again and Blas slept. I could hear his breathing, but there were other sounds too: waves, a distant breeze, the creaking of tree limbs heavy with snow.
"I think he might be asleep," said the voice in the dark.
It was a strange thing to hear that voice again: we had come to know it so well since we'd met—it had been there in my head for weeks and weeks, but now suddenly it seemed like an old friend, someone I knew very well from childhood days: my mother's voice, or the sound of the sea. I couldn't quite work out what it was. And then again, that name was gone, swirling round in me like a leaf flung against a stone in a river. Blas. It must have been some kind of bird, perhaps a small bird with a short tail.
"I think he might be asleep," said the voice. "What shall we do?"
"Shall we go to bed?" I asked. I could hear my own words coming out of the dark cave like birdsong: they had been there in me for days and now suddenly they were back again, like a message from the past. And then it was gone too.
The voice sighed with relief as if at some unsaid thing that was now gone—and it came to me again: "Shall we
llama_print_timings:        load time =  1072.38 ms
llama_print_timings:      sample time =   401.94 ms /   512 runs   (    0.79 ms per run)
llama_print_timings: prompt eval time = 15402.35 ms /   263 tokens (   58.56 ms per token)
llama_print_timings:        eval time = 157868.10 ms /   510 runs   (  309.55 ms per run)
llama_print_timings:       total time = 174278.01 ms

real    2m54.504s
user    46m6.640s
sys 3m35.773s

47 second run calling ./main linked against AMD bliss BLAS libs:

llama.cpp$ make -f Makefile.bliss clean && LLAMA_OPENBLAS=1 make -f Makefile.bliss;ldd ./main;time ./main -t 16 -m ./models/7B/ggml-model-q4_0.bin -b 256 -n 512 -p "blis or blas"
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0
I CXX:      g++ (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0

rm -vf *.o main quantize perplexity embedding
removed 'common.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'perplexity'
removed 'embedding'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/blis
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  -lblis
I CC:       cc (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0
I CXX:      g++ (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/blis   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/main/main.cpp ggml.o llama.o common.o -o main -lblis

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lblis
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity -lblis
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding -lblis
    linux-vdso.so.1 (0x00007fff553ed000)
    libblis.so.4 => /usr/local/lib/libblis.so.4 (0x00007f1011a8c000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1011862000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f101177b000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f101175b000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1011533000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f101200e000)
main: seed = 1680212135
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 256, n_predict = 512, n_keep = 0

 blis or blas
I know the word is spelled either with a B, L or an S. I believe it was used for either a sword (blade) or a small dagger or short knife.
Does anyone know the correct spelling?
Thanks in advance to everyone who might answer this question.
blis, blas
It is indeed possible that it's spelt both ways: https://en.wikipedia.org/wiki/Bliss_(disambiguation)
So you are correct it could be either way but it would depend on the context. It is used in Scottish names like Bliss-Carver or Blaisdell for example.
I agree with @MikeSteeden, it's possible to find it spelled both ways. I am not sure what exactly is your question: do you want to know if either one of these variants is correct? If so, then the answer is yes: bliss and blaise are acceptable spellings.
I wanted to know which is correct. It's a family name and I just got confused as to which spelling is correct since I have seen it in two different ways. Thanks for your answers. [end of text]

llama_print_timings:        load time =  1076.06 ms
llama_print_timings:      sample time =   190.66 ms /   243 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =   482.64 ms /     6 tokens (   80.44 ms per token)
llama_print_timings:        eval time = 46036.34 ms /   242 runs   (  190.23 ms per run)
llama_print_timings:       total time = 47307.07 ms

real    0m47.525s
user    12m20.192s
sys 0m1.248s

Perplexity run with blis doesn't process any chunks :

llama.cpp$ make -f Makefile.bliss clean && LLAMA_OPENBLAS=1 make -f Makefile.bliss;ldd ./perplexity ;time ./perplexity -t 16 -m ./models/7B/ggml-model-q4_0.bin -f /data/llama/wikitext-2-raw/wiki.wiki.test.raw
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  
I CC:       cc (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0
I CXX:      g++ (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0

rm -vf *.o main quantize perplexity embedding
removed 'common.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'perplexity'
removed 'embedding'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/blis
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread
I LDFLAGS:  -lblis
I CC:       cc (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0
I CXX:      g++ (Ubuntu 10.4.0-4ubuntu1~22.04) 10.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/blis   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread -c examples/common.cpp -o common.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/main/main.cpp ggml.o llama.o common.o -o main -lblis

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lblis
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity -lblis
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding -lblis
    linux-vdso.so.1 (0x00007ffced7f6000)
    libblis.so.4 => /usr/local/lib/libblis.so.4 (0x00007fbe1ee26000)
    libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fbe1ebfc000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbe1eb15000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fbe1eaf5000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbe1e8cd000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fbe1f3a6000)
main: seed = 1680214250
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 0 chunks

llama_print_timings:        load time =     9.90 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =   578.24 ms

real    0m0.700s
user    0m0.105s
sys 0m0.579s
gjmulder commented 1 year ago

Just checked OpenBLAS. Same behaviour.

slaren commented 1 year ago

-f /data/llama/wikitext-2-raw/wiki.wiki.test.raw

Is that the right file name? Probably the real issue here is that when -f is used with a non-existing file it doesn't show any error.

slaren commented 1 year ago

On a side note keep in mind that using BLAS to evaluate the perplexity may give misleading values, since BLAS appears to do matrix multiplication with higher precision, but it is not available when generating, only for the prompt.

gjmulder commented 1 year ago

-f /data/llama/wikitext-2-raw/wiki.wiki.test.raw

Is that the right file name? Probably the real issue here is that when -f is used with a non-existing file it doesn't show any error.

Good catch. Running now. TVM.

FNsi commented 1 year ago

I only install the blis, and do the same as you did.

my system_info in main.cpp do not show BLAS =1. But with muti speed bust.

make LLAMA_OPENBLAS=1 I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/blis I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread I LDFLAGS: -lblis I CC: cc (Ubuntu 12.2.0-3ubuntu1) 12.2.0 I CXX: g++ (Ubuntu 12.2.0-3ubuntu1) 12.2.0

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/main/main.cpp ggml.o llama.o common.o -o main -lblis

==== Run ./main -h for help. ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lblis g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity -lblis g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding -lblis

main: warning: model does not support context sizes greater than 2048 tokens (5377 specified);expect poor results main: seed = 1680252440 llama_model_load: loading model from 'models/30B/ggml-model-q4_1.bin' - please wait ... llama_model_load: n_vocab = 32000 llama_model_load: n_ctx = 5377 llama_model_load: n_embd = 6656 llama_model_load: n_mult = 256 llama_model_load: n_head = 52 llama_model_load: n_layer = 60 llama_model_load: n_rot = 128 llama_model_load: f16 = 3 llama_model_load: n_ff = 17920 llama_model_load: n_parts = 4 llama_model_load: type = 3 llama_model_load: ggml map size = 23269.46 MB llama_model_load: ggml ctx size = 151.25 KB llama_model_load: mem required = 25573.60 MB (+ 3124.00 MB per state) llama_model_load: loading tensors from 'models/30B/ggml-model-q4_1.bin' llama_model_load: model size = 23269.01 MB / num tensors = 543 llama_init_from_file: kv self size = 8191.52 MB

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: interactive mode on. Reverse prompt: 'User:' sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 generate: n_ctx = 5377, n_batch = 8, n_predict = -1, n_keep = 0

gjmulder commented 1 year ago

I only install the blis, and do the same as you did.

my system_info in main.cpp do not show BLAS =1. But with muti speed bust.

I think you need to increase the batch size to cause it to use BLAS.

Note that you also have to use the following when building BLISS to enable BLAS support in BLISS:

./configure auto --enable-cblas

If you get it to work keep an eye on your total CPU% using top as there's some weird behaviour where BLISS causes llama.cpp to sit on using just 1 core at 100% independent of the -t config. I'm currently trying to isolate it.

FNsi commented 1 year ago

I only install the blis, and do the same as you did.

my system_info in main.cpp do not show BLAS =1. But with muti speed bust.

I think you need to increase the batch size to cause it to use BLAS.

Note that you also have to use the following when building BLISS to enable BLAS support in BLISS:

./configure auto --enable-cblas

If you get it to work keep an eye on your total CPU% using top as there's some weird behaviour where BLISS causes llama.cpp to sit on using just 1 core at 100% independent of the -t config. I'm currently trying to isolate it.

I checked my blis config.mk and show mk_Enable_BLAS = yes.

But your mentioned cblas no.

Do you think I need to change that value from no to yes?

gjmulder commented 1 year ago

One major inconsistency is that the text generated is of different lengths with the different BLAS libs, so the total time for BLISS was less because of an [end of text]token. So no magic 2X performance gain.

"When it is too good to be true, it is probably not true!"

I'll see if I can get an apples-to-apples perplexity run working.

gjmulder commented 1 year ago

I checked my blis config.mk and show mk_Enable_BLAS = yes.

But your mentioned cblas no.

Do you think I need to change that value from no to yes?

For a BLISS clean build:

blis$ configure auto --enable-cblas
blis$ make clean;make -j
blis$ sudo make install

If ./main is throwing symbol errors it is because CBLAS support is not included in libblis.so.4

FNsi commented 1 year ago

I checked my blis config.mk and show mk_Enable_BLAS = yes.

But your mentioned cblas no.

Do you think I need to change that value from no to yes?

For a BLISS clean build:


blis$ configure auto --enable-cblas

blis$ make clean;make -j

blis$ sudo make install

If ./main is throwing symbol errors it is because CBLAS support is not included in libblis.so.4

I rebuild the blis with

./configure --enable-cblas zen3 Make Make install

Then rebuild llama.cpp by make llama_openblas=1

And nothing change....

Besides I change -b to 256,

Still BLAS=0, seems I need to install openblas?😅😂

gjmulder commented 1 year ago

Did you change the Makefile to link against BLIS instead of OpenBLAS?

I wouldn't worry about it. There's clearly some weird threading interaction between BLIS and llama.cpp and the performance gains disappear when the output is of the same length:

OpenBLAS:

llama_print_timings:        load time =  1102.72 ms
llama_print_timings:      sample time =   199.87 ms /   256 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =   539.28 ms /     7 tokens (   77.04 ms per token)
llama_print_timings:        eval time = 53608.58 ms /   255 runs   (  210.23 ms per run)
llama_print_timings:       total time = 54915.92 ms

BLIS:

llama_print_timings:        load time =  1106.59 ms
llama_print_timings:      sample time =   201.36 ms /   256 runs   (    0.79 ms per run)
llama_print_timings: prompt eval time =   560.20 ms /     7 tokens (   80.03 ms per token)
llama_print_timings:        eval time = 53431.06 ms /   255 runs   (  209.53 ms per run)
llama_print_timings:       total time = 54743.28 ms
omarkazmi commented 1 year ago

Did you compile blis with multithreading enabled? It defaults to off. Haven't tested to see if that's the threading interaction yet, though. https://github.com/flame/blis/blob/master/docs/Multithreading.md#enabling-multithreading

FNsi commented 1 year ago

Did you change the Makefile to link against BLIS instead of OpenBLAS?

I wouldn't worry about it. There's clearly some weird threading interaction between BLIS and llama.cpp and the performance gains are minimal at best per token:

OpenBLAS:


llama_print_timings:        load time =  1102.72 ms

llama_print_timings:      sample time =   199.87 ms /   256 runs   (    0.78 ms per run)

llama_print_timings: prompt eval time =   539.28 ms /     7 tokens (   77.04 ms per token)

llama_print_timings:        eval time = 53608.58 ms /   255 runs   (  210.23 ms per run)

llama_print_timings:       total time = 54915.92 ms

BLIS:


llama_print_timings:        load time =  1097.35 ms

llama_print_timings:      sample time =   185.29 ms /   236 runs   (    0.79 ms per run)

llama_print_timings: prompt eval time =   543.52 ms /     7 tokens (   77.65 ms per token)

llama_print_timings:        eval time = 46904.29 ms /   235 runs   (  199.59 ms per run)

llama_print_timings:       total time = 48190.76 ms

I think I did post it yet

DGGML_USE_OPENBLAS -I/usr/local/include/blis I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread I LDFLAGS: -lblis

The blis should work. And I think it already work since I haven't installed openblas but speed increased

Still somehow not make sense.

gjmulder commented 1 year ago

Did you compile blis with multithreading enabled? It defaults to off. Haven't tested to see if that's the threading interaction yet, though. https://github.com/flame/blis/blob/master/docs/Multithreading.md#enabling-multithreading

Good idea. Tried it, but it dIdn't seem to change anything.

FNsi commented 1 year ago

with blis, even the BLAS=0 been showed

llama_print_timings: load time = 4804.64 ms llama_print_timings: sample time = 63.30 ms / 128 runs ( 0.49 ms per run) llama_print_timings: prompt eval time = 2770.03 ms / 6 tokens ( 461.67 ms per token) llama_print_timings: eval time = 77765.66 ms / 127 runs ( 612.33 ms per run) llama_print_timings: total time = 82635.44 ms

real 1m23.066s user 21m19.269s sys 0m3.417s

without blis

llama_print_timings: load time = 4730.88 ms llama_print_timings: sample time = 61.95 ms / 128 runs ( 0.48 ms per run) llama_print_timings: prompt eval time = 2677.84 ms / 6 tokens ( 446.31 ms per token) llama_print_timings: eval time = 78713.50 ms / 127 runs ( 619.79 ms per run) llama_print_timings: total time = 83508.30 ms

real 1m23.950s user 21m29.628s sys 0m3.575s

conclusion: wired!

Did more test in -b 128 but still with llamaopenblas=1 builds performance are the slower.... even I was thinking the speed increased.

Maybe the problem is the my system structure since i use apx to manage my system apps.

omarkazmi commented 1 year ago

@FNsi I just bypassed the whole LLAMA_OPENBLAS flag by forcing the flags into default in the makefile. Mine looks like

CFLAGS   = -I.              -O3 -DNDEBUG -std=c11   -fPIC -DGGML_USE_OPENBLAS -I/usr/local/include/blis
CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
LDFLAGS  = -lblis

around line 35 or so. BLAS=1 is shown when I run inference.

gjmulder commented 1 year ago

I'll re-open it if people are interested in playing around with BLIS.

Similar to OpenBLAS, export BLIS_NUM_THREADS=2 seems to be ignored by llama.cpp

FNsi commented 1 year ago

@FNsi I just bypassed the whole LLAMA_OPENBLAS flag by forcing the flags into default in the makefile. Mine looks like


CFLAGS   = -I.              -O3 -DNDEBUG -std=c11   -fPIC -DGGML_USE_OPENBLAS -I/usr/local/include/blis

CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC

LDFLAGS  = -lblis

around line 35 or so. BLAS=1 is shown when I run inference.

I think I realized the problem I made

just figured out with abroot, I need to change

DGGML_USE_OPENBLAS -I/usr/local/include/blis

to

DGGML_USE_OPENBLAS -I/.system/usr/local/include/blis

So the llama.cpp I built just bypassed that even with llamaopenblas=1

Thank your guys😂

omarkazmi commented 1 year ago

For what it's worth, there seems to be two BLIS repos, the AMD maintained fork at https://github.com/amd/blis, and the original at https://github.com/flame/blis which is updated far more frequently. I'm not sure if the original repo maintainers are incorporating AMD's changes but it might be worth comparing the two if someone's doing performance testing anyway.

gjmulder commented 1 year ago

For what it's worth, there seems to be two BLIS repos, the AMD maintained fork at https://github.com/amd/blis, and the original at https://github.com/flame/blis which is updated far more frequently. I'm not sure if the original repo maintainers are incorporating AMD's changes but it might be worth comparing the two if someone's doing performance testing anyway.

blis$ git log | head -4
commit e3fc540b972a25f618af2e055641ad00ca51113e
Merge: 77c8f069 ea4acd26
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date:   Sat Nov 12 13:37:42 2022 +0530
gjmulder commented 1 year ago
blis$ git log | head -3
commit 38fc5237520a2f20914a9de8bb14d5999009b3fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 30 17:30:07 2023 -0500
llama_print_timings:        load time =  1083.79 ms
llama_print_timings:      sample time =   200.86 ms /   256 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =   533.84 ms /     7 tokens (   76.26 ms per token)
llama_print_timings:        eval time = 53060.28 ms /   255 runs   (  208.08 ms per run)
llama_print_timings:       total time = 54349.16 ms

:man_shrugging:

omarkazmi commented 1 year ago

@gjmulder Same threading issues too?

gjmulder commented 1 year ago

@gjmulder Same threading issues too?

@omarkazmi it is nearly twice as fast when doing perplexity! Wohoo! Before it was sitting at 100% CPU, now 187% :partying_face: :partying_face: :partying_face:

EDIT: That was sarcasm. 2200+% CPU with OpenBLAS.

FNsi commented 1 year ago

Funny things happened again With Blas the speed is dropped with prompt eval

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 generate: n_ctx = 5377, n_batch = 256, n_predict = 512, n_keep = 0

blis or blas (blis)

  1. A blow; a buffet; a box on the ear; hence, a check; restraint; hindrance; obstruction; clog; embarrassment. "Thou shalt not oppress an hireling ... because he is poor and needy: that thy brother may live with thee in the land." (Deut. xxiv. 14). "And blessed be the Lord for evermore." (Neh. ix. 5)
  2. An injury; harm done to a thing, or an event which does harm. "I am not a man of violence that I should offer to touch you with my hand and to strike you." "The evil spirit from God was upon him." (Judges xiv. 19). "I have done a blot, and thou art the cause of it."
  3. Intransitive verb To do an injury; to harm one's self or others. "What! is this all that I have cost thee?" "You will not believe me. You are hardened in your sinfulness: you will never repent. But, oh God, what a bliss you have thrown away!"
  4. To make blisters. "He was well rubbed with the blister."
  5. To be disposed of; to go; to come along. "Bless me! The very name of that place has such an effect upon my nerves, that I never can sleep a wink at night after thinking of it, more especially if it be mentioned to me between the hours of ten and four in the morning."
  6. Intransitive verb To become red; to flush. "You may not be able to tell whether you are blushing, or whether it is only the sun shining full upon your face."
  7. An outrageous offense; an unpardonable crime. "I have committed a bliss. I am condemned, and justly too: but it grieves me to the heart that my brother should be the witness of my shame and ignominy."
  8. The act or practice of blessing; benediction; praise; commendation. "What is there in heaven, what is there in earth, that will make amends for your bliss?" "You have done a blot; but you are not to be pitied on that account." 9

    llama_print_timings: load time = 4863.57 ms llama_print_timings: sample time = 292.69 ms / 512 runs ( 0.57 ms per run) llama_print_timings: prompt eval time = 2797.95 ms / 6 tokens ( 466.33 ms per token) llama_print_timings: eval time = 545305.15 ms / 511 runs ( 1067.13 ms per run) llama_print_timings: total time = 550469.28 ms

real 9m10.882s user 142m39.586s sys 0m5.066s

system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 generate: n_ctx = 5377, n_batch = 256, n_predict = 512, n_keep = 0

blis or blas (blys),USA pronunciation adj., adv. not clear, sharp, distinct, or intelligible: a blurred photograph. to become blurred or unclear, as in outline, form, shape, character, etc.: His features blurred as he stepped into the fog. Hello guys, this picture is about Blis Networks (good Blix Networks Amazing Ideas #1). This post is a image/jpeg and the resolution of this attachment is 650 x 372. It's file size is only 49 KB. Wether You decided to save it to Your laptop, you have to Click here. You might also see more attachments by clicking the following image or read more at here: Blix Networks. Everybody knows that coloring is one of many most important aspects in making a layout that is beautiful room. Colour can be an essential part for decorating remodeling or generating designs, thus choosing the colors that are right have to be carefully considered. As stated in the previous post, along with may push effect on connection and emotion. Consequently, you ought to pay specific awareness in deciding on the best coloring for your household bedrooms. The sack is just a refuge where we sleep once we are tired, an area where we sleep, tired of the everyday routine, or maybe when we are sick. A place should be quiet and tranquil the most important bedroom in which we can close our doorways. Bedrooms must be vibrant as well as airy colours. Because of the importance of the big event of the room, you want to share the very best bedroom designs. We ought to select coloring and the design that may produce us realize satisfaction and luxury. Harmony wills drive in a chaotic day. By having an area with superior Blis Networks (good Blix Networks Amazing Ideas #1) colour can be a luxury by itself, you'll observe. [end of text]

llama_print_timings: load time = 4913.29 ms llama_print_timings: sample time = 217.60 ms / 431 runs ( 0.50 ms per run) llama_print_timings: prompt eval time = 2869.76 ms / 6 tokens ( 478.29 ms per token) llama_print_timings: eval time = 328921.86 ms / 430 runs ( 764.93 ms per run) llama_print_timings: total time = 334058.83 ms

real 5m34.481s user 87m25.000s sys 0m5.262s

gjmulder commented 1 year ago

@FNsi it is only six tokens. The difference in performance is likely due to the shortness of the sample.

llama_print_timings: eval time per run looks to improve by about 25% w/BLAS.

Note that longer runs look to progressively take longer for each additional token generated, so some of the 25% gain might be due to the fact that the BLAS run generated 81 less tokens.

FNsi commented 1 year ago

@FNsi it is only six tokens. The difference in performance is likely due to the shortness of the sample.

llama_print_timings: eval time per run looks to improve by about 25% w/BLAS.

Note that longer runs look to progressively take longer for each additional token, so some of the 25% gain is may be due to the fact that the BLAS run generated 81 less tokens.

I agree, and I saw your comment about 2000% increase? How you made it? I also try build the multi threads of blis, however seems nothing different.

gjmulder commented 1 year ago

I have 16 AMD cores (i.e. 32 hypercores). With BLAS -t 16 I get a load average of around 22. With BLIS and long prompts or perplexity runs the load average was less than 2.0.

FNsi commented 1 year ago

I have 16 AMD cores (i.e. 32 hypercores). With BLAS -t 16 I get a load average of around 22. With BLIS and long prompts or perplexity runs the load average was less than 2.0.

I assume the first 'with' you said is without?

That's a huge improvement!

gjmulder commented 1 year ago

BLAS seems to want to multithread independent of what I set OPEN_BLAS_NUM_THREADS to. BLAS therefore looks to be spawning 2 threads per llama.cpp thread, but those 32 total threads aren't running at 100% CPU or I'd expect to see a load average close to 32.

FNsi commented 1 year ago
  • Without BLAS -t 16 has a load average of 16

  • With BLAS -t 16 has a load average of about 22

BLAS seems to want to multithread independent of what I set OPEN_BLAS_NUM_THREADS to. BLAS therefore looks to be spawning 2 threads per llama.cpp thread, but those 32 total threads aren't running at 100% CPU or I'd expect to see a load average close to 32.

And if the blas can be running in 16 threads......