[SYCL] Fix the sub group size of Intel

luoyu-intel commented 3 months ago

[x] I have read the contributing guidelines
Self-reported review complexity:
- [x] Low
- [ ] Medium
- [ ] High

Changes:

sync the WARP_SIZE macro, and replace hard-coding 32 numbers.
add a macro condition to change WARP_SIZE to 16 for Intel's GPUs
remove hard-coding work_group_size=1024 constraint of *_norm_f32
remove unused shared local memory
output the correct tokens for Debug build, before the logits are all -nan. It's an issue from dpcpp: https://github.com/intel/llvm/issues/14274
move *_norm_f32 to norm.cpp file
fix the bug of ignoring src1_ncols of mmvq

Debug can output the same tokens as release in this PR(master runs into exceptions):

 What's your name?ggml_gallocr_needs_realloc: node inp_embd is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

 Unterscheidung between the two words "name" and "namee"

Release output:

What's your name?
 Unterscheidung between the two words "name" and "namee"

Performance benefit

Intel Arc A770 37 tokens/s to 39 tokens/s (Windows + 9600K):

 Building a website can be done in 10 simple steps: Choosing a hosting platform before you begin anything, though, should come first.
llama_print_timings:        load time =    5109.68 ms
llama_print_timings:      sample time =       0.80 ms /    16 runs   (    0.05 ms per token, 20025.03 tokens per second)
llama_print_timings: prompt eval time =     271.43 ms /    14 tokens (   19.39 ms per token,    51.58 tokens per second)
llama_print_timings:        eval time =     383.75 ms /    15 runs   (   25.58 ms per token,    39.09 tokens per second)
llama_print_timings:       total time =     666.89 ms /    29 tokens
Log end

38.9 tokens/s to 41.8 tokens/s (Linux + Xeon 4th):

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun in the way that little girls like you usually have. So she played outside with
llama_print_timings:        load time =   14153.51 ms
llama_print_timings:      sample time =       0.72 ms /    16 runs   (    0.04 ms per token, 22284.12 tokens per second)
llama_print_timings: prompt eval time =     391.66 ms /    33 tokens (   11.87 ms per token,    84.26 tokens per second)
llama_print_timings:        eval time =     359.29 ms /    15 runs   (   23.95 ms per token,    41.75 tokens per second)
llama_print_timings:       total time =     757.59 ms /    48 tokens

airMeng commented 3 months ago

@AidanBeltonS @OuadiElfarouki @joeatodd if no regression on your side, I will merge it soon

luoyu-intel commented 3 months ago

The quantize functions limit the WARP_SIZE equals block size=32, there is remaining work for this.

luoyu-intel commented 2 months ago

Hi @joeatodd please review the change.

luoyu-intel commented 2 months ago

I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.

This PR branch with prompt_length=9: Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside prompt_length=8: Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col

The master branch with prompt_length=9: Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear prompt_length=8: Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni

The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.

qnixsynapse commented 2 months ago

I somehow missed this. Using this patch, the gemma model is broken, atleast on Q4_K_S

luoyu-intel commented 2 months ago

@qnixsynapse Is your prompt "Hi"? The SYCL backend had this repeat issue a long time ago.

qnixsynapse commented 2 months ago

This is llama -3 8B. Not sure what went wrong but speed has been increased.

luoyu-intel commented 2 months ago

I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.

This PR branch with prompt_length=9: Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside prompt_length=8: Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col

The master branch with prompt_length=9: Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear prompt_length=8: Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni

The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.

You can check this comment. How about a longer prompt?

qnixsynapse commented 2 months ago

llama-8B on a longer prompt. I have Arc A750 GPU if that matters.

luoyu-intel commented 2 months ago

I think there are two issues: a). short prompt produces the repeating tokens. b). garbage tokens when the context length is larger than some values.

luoyu-intel commented 2 months ago

@qnixsynapse The first one is confirmed as an existing issue of the master branch. I will look into the second one to see whether it is introduced by this PR.

qnixsynapse commented 2 months ago

It's a regression since before this patch it used to work well(although a bit slower). I am still trying to debug. Sorry that I couldn't test it before because I was hooked in testing Gemma models.

luoyu-intel commented 2 months ago

I didn't test Q4_K_S models. I will test it on A770.

qnixsynapse commented 2 months ago

Yup confirmed. Works great on CPU. Tested iQ4_XS and Q4_K_S models.

Edit: Will test on Q4_0 model (although this is a legacy quant)

Edit 2: Broken on q4_0 model as well.

Edit 3: I will test with increasing the warp size manually later to see if that fixes the issue. (I know it shouldn't but still)

luoyu-intel commented 2 months ago

PR Q4_0, warp_size=32
 Once upon a time, there is a small village nestled in the rolling hills of the countryside. Unterscheidung between the two is not always clear-cut, and both terms are often used interchangeably. The village is home to a small population of people who live and work together in a close-knit community.

PR Q4_0, warp_size=16
 Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also

PR Q4K_S, warp_size=32
 Once upon a time, there is a small village nestled in the mountains. The villagers lived simple lives, farming the land and raising their families. But one day, a great evil descended upon the village, in the form of a powerful sorcerer.
The sorcerer was angry and resentful towards the villagers, and

PR Q4K_S, warp_size=16
 Once upon a time, there is a smalloshtztztzrtrtrtrtttt tt tt tuleuleuleuleuleuleuleuleuleuleuleule Roman Roman Roman Roman Roman Romananeaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneane@{ane

I've confirmed this bug on Q4K_S.

qnixsynapse commented 2 months ago

PR Q4_0, warp_size=16 Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also

hills of the countryside.rezzo

It's also broken on Q4_0

Working great on iQ4_XS quant as well.

airMeng commented 2 months ago

@qnixsynapse BTW which UI are you using, looking quite cool

qnixsynapse commented 2 months ago

@airMeng It's chainlit.

luoyu-intel commented 2 months ago

@qnixsynapse WARP_SIZE=32 works fine for me. I can change WARP_SIZE to 32 for Intel GPUs in the new PR to revert this regression. Do you agree with this?

qnixsynapse commented 2 months ago

@luoyu-intel Sure. :)

Edit: BTW, I am getting about 30 tokens/sec with iQ4_XS, earlier generation speed was 20 tokens/sec; with warp_size of 32 and the other portions of this PR, so please don't revert anything else. :)

ggerganov / llama.cpp

[SYCL] Fix the sub group size of Intel #8106

Changes:

Performance benefit