Closed gjmulder closed 1 year ago
Just checked OpenBLAS. Same behaviour.
-f /data/llama/wikitext-2-raw/wiki.wiki.test.raw
Is that the right file name? Probably the real issue here is that when -f is used with a non-existing file it doesn't show any error.
On a side note keep in mind that using BLAS to evaluate the perplexity may give misleading values, since BLAS appears to do matrix multiplication with higher precision, but it is not available when generating, only for the prompt.
-f /data/llama/wikitext-2-raw/wiki.wiki.test.raw
Is that the right file name? Probably the real issue here is that when -f is used with a non-existing file it doesn't show any error.
Good catch. Running now. TVM.
I only install the blis, and do the same as you did.
my system_info in main.cpp do not show BLAS =1. But with muti speed bust.
make LLAMA_OPENBLAS=1 I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mavx -mavx2 -mfma -mf16c -msse3 -DGGML_USE_OPENBLAS -I/usr/local/include/blis I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread I LDFLAGS: -lblis I CC: cc (Ubuntu 12.2.0-3ubuntu1) 12.2.0 I CXX: g++ (Ubuntu 12.2.0-3ubuntu1) 12.2.0
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/main/main.cpp ggml.o llama.o common.o -o main -lblis
==== Run ./main -h for help. ====
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lblis g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/perplexity/perplexity.cpp ggml.o llama.o common.o -o perplexity -lblis g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/embedding/embedding.cpp ggml.o llama.o common.o -o embedding -lblis
main: warning: model does not support context sizes greater than 2048 tokens (5377 specified);expect poor results main: seed = 1680252440 llama_model_load: loading model from 'models/30B/ggml-model-q4_1.bin' - please wait ... llama_model_load: n_vocab = 32000 llama_model_load: n_ctx = 5377 llama_model_load: n_embd = 6656 llama_model_load: n_mult = 256 llama_model_load: n_head = 52 llama_model_load: n_layer = 60 llama_model_load: n_rot = 128 llama_model_load: f16 = 3 llama_model_load: n_ff = 17920 llama_model_load: n_parts = 4 llama_model_load: type = 3 llama_model_load: ggml map size = 23269.46 MB llama_model_load: ggml ctx size = 151.25 KB llama_model_load: mem required = 25573.60 MB (+ 3124.00 MB per state) llama_model_load: loading tensors from 'models/30B/ggml-model-q4_1.bin' llama_model_load: model size = 23269.01 MB / num tensors = 543 llama_init_from_file: kv self size = 8191.52 MB
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: interactive mode on. Reverse prompt: 'User:' sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 generate: n_ctx = 5377, n_batch = 8, n_predict = -1, n_keep = 0
I only install the blis, and do the same as you did.
my system_info in main.cpp do not show BLAS =1. But with muti speed bust.
I think you need to increase the batch size to cause it to use BLAS.
Note that you also have to use the following when building BLISS to enable BLAS support in BLISS:
./configure auto --enable-cblas
If you get it to work keep an eye on your total CPU% using top
as there's some weird behaviour where BLISS causes llama.cpp
to sit on using just 1 core at 100% independent of the -t
config. I'm currently trying to isolate it.
I only install the blis, and do the same as you did.
my system_info in main.cpp do not show BLAS =1. But with muti speed bust.
I think you need to increase the batch size to cause it to use BLAS.
Note that you also have to use the following when building BLISS to enable BLAS support in BLISS:
./configure auto --enable-cblas
If you get it to work keep an eye on your total CPU% using
top
as there's some weird behaviour where BLISS causesllama.cpp
to sit on using just 1 core at 100% independent of the-t
config. I'm currently trying to isolate it.
I checked my blis config.mk and show mk_Enable_BLAS = yes.
But your mentioned cblas no.
Do you think I need to change that value from no to yes?
One major inconsistency is that the text generated is of different lengths with the different BLAS libs, so the total time for BLISS was less because of an [end of text]
token. So no magic 2X performance gain.
"When it is too good to be true, it is probably not true!"
I'll see if I can get an apples-to-apples perplexity run working.
I checked my blis config.mk and show mk_Enable_BLAS = yes.
But your mentioned cblas no.
Do you think I need to change that value from no to yes?
For a BLISS clean build:
blis$ configure auto --enable-cblas
blis$ make clean;make -j
blis$ sudo make install
If ./main
is throwing symbol errors it is because CBLAS support is not included in libblis.so.4
I checked my blis config.mk and show mk_Enable_BLAS = yes.
But your mentioned cblas no.
Do you think I need to change that value from no to yes?
For a BLISS clean build:
blis$ configure auto --enable-cblas blis$ make clean;make -j blis$ sudo make install
If
./main
is throwing symbol errors it is because CBLAS support is not included inlibblis.so.4
I rebuild the blis with
./configure --enable-cblas zen3 Make Make install
Then rebuild llama.cpp by make llama_openblas=1
And nothing change....
Besides I change -b to 256,
Still BLAS=0, seems I need to install openblas?😅😂
Did you change the Makefile to link against BLIS instead of OpenBLAS?
I wouldn't worry about it. There's clearly some weird threading interaction between BLIS and llama.cpp
and the performance gains disappear when the output is of the same length:
OpenBLAS:
llama_print_timings: load time = 1102.72 ms
llama_print_timings: sample time = 199.87 ms / 256 runs ( 0.78 ms per run)
llama_print_timings: prompt eval time = 539.28 ms / 7 tokens ( 77.04 ms per token)
llama_print_timings: eval time = 53608.58 ms / 255 runs ( 210.23 ms per run)
llama_print_timings: total time = 54915.92 ms
BLIS:
llama_print_timings: load time = 1106.59 ms
llama_print_timings: sample time = 201.36 ms / 256 runs ( 0.79 ms per run)
llama_print_timings: prompt eval time = 560.20 ms / 7 tokens ( 80.03 ms per token)
llama_print_timings: eval time = 53431.06 ms / 255 runs ( 209.53 ms per run)
llama_print_timings: total time = 54743.28 ms
Did you compile blis with multithreading enabled? It defaults to off. Haven't tested to see if that's the threading interaction yet, though. https://github.com/flame/blis/blob/master/docs/Multithreading.md#enabling-multithreading
Did you change the Makefile to link against BLIS instead of OpenBLAS?
I wouldn't worry about it. There's clearly some weird threading interaction between BLIS and
llama.cpp
and the performance gains are minimal at best per token:OpenBLAS:
llama_print_timings: load time = 1102.72 ms llama_print_timings: sample time = 199.87 ms / 256 runs ( 0.78 ms per run) llama_print_timings: prompt eval time = 539.28 ms / 7 tokens ( 77.04 ms per token) llama_print_timings: eval time = 53608.58 ms / 255 runs ( 210.23 ms per run) llama_print_timings: total time = 54915.92 ms
BLIS:
llama_print_timings: load time = 1097.35 ms llama_print_timings: sample time = 185.29 ms / 236 runs ( 0.79 ms per run) llama_print_timings: prompt eval time = 543.52 ms / 7 tokens ( 77.65 ms per token) llama_print_timings: eval time = 46904.29 ms / 235 runs ( 199.59 ms per run) llama_print_timings: total time = 48190.76 ms
I think I did post it yet
DGGML_USE_OPENBLAS -I/usr/local/include/blis I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread I LDFLAGS: -lblis
The blis should work. And I think it already work since I haven't installed openblas but speed increased
Still somehow not make sense.
Did you compile blis with multithreading enabled? It defaults to off. Haven't tested to see if that's the threading interaction yet, though. https://github.com/flame/blis/blob/master/docs/Multithreading.md#enabling-multithreading
Good idea. Tried it, but it dIdn't seem to change anything.
with blis, even the BLAS=0 been showed
llama_print_timings: load time = 4804.64 ms llama_print_timings: sample time = 63.30 ms / 128 runs ( 0.49 ms per run) llama_print_timings: prompt eval time = 2770.03 ms / 6 tokens ( 461.67 ms per token) llama_print_timings: eval time = 77765.66 ms / 127 runs ( 612.33 ms per run) llama_print_timings: total time = 82635.44 ms
real 1m23.066s user 21m19.269s sys 0m3.417s
without blis
llama_print_timings: load time = 4730.88 ms llama_print_timings: sample time = 61.95 ms / 128 runs ( 0.48 ms per run) llama_print_timings: prompt eval time = 2677.84 ms / 6 tokens ( 446.31 ms per token) llama_print_timings: eval time = 78713.50 ms / 127 runs ( 619.79 ms per run) llama_print_timings: total time = 83508.30 ms
real 1m23.950s user 21m29.628s sys 0m3.575s
conclusion: wired!
Did more test in -b 128 but still with llamaopenblas=1 builds performance are the slower.... even I was thinking the speed increased.
Maybe the problem is the my system structure since i use apx to manage my system apps.
@FNsi I just bypassed the whole LLAMA_OPENBLAS flag by forcing the flags into default in the makefile. Mine looks like
CFLAGS = -I. -O3 -DNDEBUG -std=c11 -fPIC -DGGML_USE_OPENBLAS -I/usr/local/include/blis
CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
LDFLAGS = -lblis
around line 35 or so. BLAS=1 is shown when I run inference.
I'll re-open it if people are interested in playing around with BLIS.
Similar to OpenBLAS, export BLIS_NUM_THREADS=2
seems to be ignored by llama.cpp
@FNsi I just bypassed the whole LLAMA_OPENBLAS flag by forcing the flags into default in the makefile. Mine looks like
CFLAGS = -I. -O3 -DNDEBUG -std=c11 -fPIC -DGGML_USE_OPENBLAS -I/usr/local/include/blis CXXFLAGS = -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC LDFLAGS = -lblis
around line 35 or so. BLAS=1 is shown when I run inference.
I think I realized the problem I made
just figured out with abroot, I need to change
DGGML_USE_OPENBLAS -I/usr/local/include/blis
to
DGGML_USE_OPENBLAS -I/.system/usr/local/include/blis
So the llama.cpp I built just bypassed that even with llamaopenblas=1
Thank your guys😂
For what it's worth, there seems to be two BLIS repos, the AMD maintained fork at https://github.com/amd/blis, and the original at https://github.com/flame/blis which is updated far more frequently. I'm not sure if the original repo maintainers are incorporating AMD's changes but it might be worth comparing the two if someone's doing performance testing anyway.
For what it's worth, there seems to be two BLIS repos, the AMD maintained fork at https://github.com/amd/blis, and the original at https://github.com/flame/blis which is updated far more frequently. I'm not sure if the original repo maintainers are incorporating AMD's changes but it might be worth comparing the two if someone's doing performance testing anyway.
blis$ git log | head -4
commit e3fc540b972a25f618af2e055641ad00ca51113e
Merge: 77c8f069 ea4acd26
Author: Kiran Varaganti <Kiran.Varaganti@amd.com>
Date: Sat Nov 12 13:37:42 2022 +0530
blis$ git log | head -3
commit 38fc5237520a2f20914a9de8bb14d5999009b3fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date: Thu Mar 30 17:30:07 2023 -0500
llama_print_timings: load time = 1083.79 ms
llama_print_timings: sample time = 200.86 ms / 256 runs ( 0.78 ms per run)
llama_print_timings: prompt eval time = 533.84 ms / 7 tokens ( 76.26 ms per token)
llama_print_timings: eval time = 53060.28 ms / 255 runs ( 208.08 ms per run)
llama_print_timings: total time = 54349.16 ms
:man_shrugging:
@gjmulder Same threading issues too?
@gjmulder Same threading issues too?
@omarkazmi it is nearly twice as fast when doing perplexity! Wohoo! Before it was sitting at 100% CPU, now 187% :partying_face: :partying_face: :partying_face:
EDIT: That was sarcasm. 2200+% CPU with OpenBLAS.
Funny things happened again With Blas the speed is dropped with prompt eval
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 generate: n_ctx = 5377, n_batch = 256, n_predict = 512, n_keep = 0
blis or blas (blis)
llama_print_timings: load time = 4863.57 ms llama_print_timings: sample time = 292.69 ms / 512 runs ( 0.57 ms per run) llama_print_timings: prompt eval time = 2797.95 ms / 6 tokens ( 466.33 ms per token) llama_print_timings: eval time = 545305.15 ms / 511 runs ( 1067.13 ms per run) llama_print_timings: total time = 550469.28 ms
real 9m10.882s user 142m39.586s sys 0m5.066s
system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000 generate: n_ctx = 5377, n_batch = 256, n_predict = 512, n_keep = 0
blis or blas (blys),USA pronunciation adj., adv. not clear, sharp, distinct, or intelligible: a blurred photograph. to become blurred or unclear, as in outline, form, shape, character, etc.: His features blurred as he stepped into the fog. Hello guys, this picture is about Blis Networks (good Blix Networks Amazing Ideas #1). This post is a image/jpeg and the resolution of this attachment is 650 x 372. It's file size is only 49 KB. Wether You decided to save it to Your laptop, you have to Click here. You might also see more attachments by clicking the following image or read more at here: Blix Networks. Everybody knows that coloring is one of many most important aspects in making a layout that is beautiful room. Colour can be an essential part for decorating remodeling or generating designs, thus choosing the colors that are right have to be carefully considered. As stated in the previous post, along with may push effect on connection and emotion. Consequently, you ought to pay specific awareness in deciding on the best coloring for your household bedrooms. The sack is just a refuge where we sleep once we are tired, an area where we sleep, tired of the everyday routine, or maybe when we are sick. A place should be quiet and tranquil the most important bedroom in which we can close our doorways. Bedrooms must be vibrant as well as airy colours. Because of the importance of the big event of the room, you want to share the very best bedroom designs. We ought to select coloring and the design that may produce us realize satisfaction and luxury. Harmony wills drive in a chaotic day. By having an area with superior Blis Networks (good Blix Networks Amazing Ideas #1) colour can be a luxury by itself, you'll observe. [end of text]
llama_print_timings: load time = 4913.29 ms llama_print_timings: sample time = 217.60 ms / 431 runs ( 0.50 ms per run) llama_print_timings: prompt eval time = 2869.76 ms / 6 tokens ( 478.29 ms per token) llama_print_timings: eval time = 328921.86 ms / 430 runs ( 764.93 ms per run) llama_print_timings: total time = 334058.83 ms
real 5m34.481s user 87m25.000s sys 0m5.262s
@FNsi it is only six tokens. The difference in performance is likely due to the shortness of the sample.
llama_print_timings: eval time
per run looks to improve by about 25% w/BLAS.
Note that longer runs look to progressively take longer for each additional token generated, so some of the 25% gain might be due to the fact that the BLAS run generated 81 less tokens.
@FNsi it is only six tokens. The difference in performance is likely due to the shortness of the sample.
llama_print_timings: eval time
per run looks to improve by about 25% w/BLAS.Note that longer runs look to progressively take longer for each additional token, so some of the 25% gain is may be due to the fact that the BLAS run generated 81 less tokens.
I agree, and I saw your comment about 2000% increase? How you made it? I also try build the multi threads of blis, however seems nothing different.
I have 16 AMD cores (i.e. 32 hypercores). With BLAS -t 16
I get a load average of around 22. With BLIS and long prompts or perplexity runs the load average was less than 2.0.
I have 16 AMD cores (i.e. 32 hypercores). With BLAS
-t 16
I get a load average of around 22. With BLIS and long prompts or perplexity runs the load average was less than 2.0.
I assume the first 'with' you said is without?
That's a huge improvement!
-t 16
has a load average of 16-t 16
has a load average of about 22BLAS seems to want to multithread independent of what I set OPEN_BLAS_NUM_THREADS
to. BLAS therefore looks to be spawning 2 threads per llama.cpp
thread, but those 32 total threads aren't running at 100% CPU or I'd expect to see a load average close to 32.
Without BLAS
-t 16
has a load average of 16With BLAS
-t 16
has a load average of about 22BLAS seems to want to multithread independent of what I set
OPEN_BLAS_NUM_THREADS
to. BLAS therefore looks to be spawning 2 threads perllama.cpp
thread, but those 32 total threads aren't running at 100% CPU or I'd expect to see a load average close to 32.
And if the blas can be running in 16 threads......
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Compiling against AMD optimized BLS implementation of BLAS allows me to run perplexity tests
Current Behavior
Compiling against AMD optimized BLS implementation of BLAS causes perplexity command to process 0 chunks
Physical (or virtual) hardware you are using, e.g. for Linux:
Operating System, e.g. for Linux:
SDK version, e.g. for Linux:
Steps to Reproduce
174 second run just calling
./main
linked against OpenBLAS:47 second run calling
./main
linked against AMD bliss BLAS libs:Perplexity run with blis doesn't process any chunks :