Closed luoyu-intel closed 2 months ago
@AidanBeltonS @OuadiElfarouki @joeatodd if no regression on your side, I will merge it soon
The quantize functions limit the WARP_SIZE equals block size=32, there is remaining work for this.
Hi @joeatodd please review the change.
I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.
This PR branch with prompt_length=9: Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside prompt_length=8: Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col
The master branch with prompt_length=9: Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear prompt_length=8: Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni
The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.
I somehow missed this. Using this patch, the gemma model is broken, atleast on Q4_K_S
@qnixsynapse Is your prompt "Hi"? The SYCL backend had this repeat issue a long time ago.
This is llama -3 8B. Not sure what went wrong but speed has been increased.
I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.
This PR branch with prompt_length=9: Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside prompt_length=8: Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col
The master branch with prompt_length=9: Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear prompt_length=8: Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni
The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.
You can check this comment. How about a longer prompt?
llama-8B on a longer prompt. I have Arc A750 GPU if that matters.
I think there are two issues: a). short prompt produces the repeating tokens. b). garbage tokens when the context length is larger than some values.
@qnixsynapse The first one is confirmed as an existing issue of the master branch. I will look into the second one to see whether it is introduced by this PR.
It's a regression since before this patch it used to work well(although a bit slower). I am still trying to debug. Sorry that I couldn't test it before because I was hooked in testing Gemma models.
I didn't test Q4_K_S models. I will test it on A770.
Yup confirmed. Works great on CPU. Tested iQ4_XS and Q4_K_S models.
Edit: Will test on Q4_0 model (although this is a legacy quant)
Edit 2: Broken on q4_0 model as well.
Edit 3: I will test with increasing the warp size manually later to see if that fixes the issue. (I know it shouldn't but still)
PR Q4_0, warp_size=32
Once upon a time, there is a small village nestled in the rolling hills of the countryside. Unterscheidung between the two is not always clear-cut, and both terms are often used interchangeably. The village is home to a small population of people who live and work together in a close-knit community.
PR Q4_0, warp_size=16
Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also
PR Q4K_S, warp_size=32
Once upon a time, there is a small village nestled in the mountains. The villagers lived simple lives, farming the land and raising their families. But one day, a great evil descended upon the village, in the form of a powerful sorcerer.
The sorcerer was angry and resentful towards the villagers, and
PR Q4K_S, warp_size=16
Once upon a time, there is a smalloshtztztzrtrtrtrtttt tt tt tuleuleuleuleuleuleuleuleuleuleuleule Roman Roman Roman Roman Roman Romananeaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneane@{ane
I've confirmed this bug on Q4K_S.
PR Q4_0, warp_size=16 Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also
hills of the countryside.rezzo
It's also broken on Q4_0
Working great on iQ4_XS quant as well.
@qnixsynapse BTW which UI are you using, looking quite cool
@airMeng It's chainlit.
@qnixsynapse WARP_SIZE=32 works fine for me. I can change WARP_SIZE to 32 for Intel GPUs in the new PR to revert this regression. Do you agree with this?
@luoyu-intel Sure. :)
Edit: BTW, I am getting about 30 tokens/sec with iQ4_XS, earlier generation speed was 20 tokens/sec; with warp_size of 32 and the other portions of this PR, so please don't revert anything else. :)
Changes:
-nan
. It's an issue from dpcpp: https://github.com/intel/llvm/issues/14274Debug can output the same tokens as release in this PR(master runs into exceptions):
Release output:
Performance benefit
Intel Arc A770 37 tokens/s to 39 tokens/s (Windows + 9600K):
38.9 tokens/s to 41.8 tokens/s (Linux + Xeon 4th):