Cuda vs Triton on an RTX 3060 12GB

1aienthusiast commented 1 year ago

cuda: 35tokens/s triton: 5tokens/s

I used ooba's webui only for cuda, because I've been unable to get triton to work with ooba's webui, I made sure i used the same parameters as in the command for triton:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/username/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading llama-7b-4bit-triton...
Traceback (most recent call last):
  File "/home/username/AI/2oobabooga/text-generation-webui/server.py", line 275, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/username/AI/2oobabooga/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/username/AI/2oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 114, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
  File "/home/username/AI/2oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 36, in _load_quant
    make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)
TypeError: make_quant() got an unexpected keyword argument 'faster'

For triton I used this command:

python3.10 generate.py --model ./ --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 1.99 --top-p 0.18 --repetition-penalty 1.15 --max-length 128

I used the 7B-4bit model (i quantized it for triton using python3.10 convert_weights.py --quant ~/AI/2oobabooga/text-generation-webui/models/llama-7b-4bit/llama-7b-4bit.safetensors --model ~/AI/oobabooga/text-generation-webui/models/LLaMA-7B/ --output ./ )

GPU: RTX 3060 12GB OS: Debian

fpgaminer commented 1 year ago

Thank you for the detailed bug report. I've got some optimizations in the works that should help. I'll reply again once those are done and hopefully you can test again.

fpgaminer commented 1 year ago

I just pushed a new commit which should resolve this issue. Please let me know if things are working faster for you now. I've included a new benchmark_generate.py script that is useful here. On my 3090 I'm seeing ~41 tokens/s for the vast majority of tasks on the Triton kernel.

1aienthusiast commented 1 year ago

I ran the benchmark with the command python benchmark_generate.py --model ../model/ --quant and I got:

Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████| 12/12 [00:45<00:00,  3.78s/it]
Done.
Prompt length: 64, max length: 4
Generation took 0.75 seconds
Average generation speed: 5.31 tokens per second

Prompt length: 8, max length: 64
Generation took 2.43 seconds
Average generation speed: 26.35 tokens per second

Prompt length: 512, max length: 1
Generation took 0.36 seconds
Average generation speed: 2.74 tokens per second

Prompt length: 1024, max length: 256
Generation took 12.04 seconds
Average generation speed: 21.27 tokens per second

Prompt length: 2048, max length: 16
Generation took 2.40 seconds
Average generation speed: 6.67 tokens per second

Prompt length: 32, max length: 2048
Generation took 88.82 seconds
Average generation speed: 23.06 tokens per second

Prompt length: 2, max length: 1024
Generation took 41.06 seconds
Average generation speed: 24.94 tokens per second

Prompt length: 2, max length: 256
Generation took 9.74 seconds
Average generation speed: 26.30 tokens per second

Prompt length: 512, max length: 8
Generation took 0.64 seconds
Average generation speed: 12.49 tokens per second

Prompt length: 2048, max length: 1024
Generation took 55.91 seconds
Average generation speed: 18.32 tokens per second

Prompt length: 2048, max length: 2048
Generation took 117.02 seconds
Average generation speed: 17.50 tokens per second

Prompt length: 8, max length: 4
Generation took 0.15 seconds
Average generation speed: 26.59 tokens per second

Prompt length: 2048, max length: 128
Generation took 7.99 seconds
Average generation speed: 16.02 tokens per second

Prompt length: 16, max length: 128
Generation took 4.91 seconds
Average generation speed: 26.05 tokens per second

Prompt length: 1, max length: 1
Generation took 0.04 seconds
Average generation speed: 25.95 tokens per second

Prompt length: 256, max length: 512
Generation took 20.82 seconds
Average generation speed: 24.59 tokens per second

Prompt length: 256, max length: 4
Generation took 0.30 seconds
Average generation speed: 13.15 tokens per second

Prompt length: 1, max length: 2048
Generation took 88.91 seconds
Average generation speed: 23.03 tokens per second

Prompt length: 1, max length: 512
Generation took 19.85 seconds
Average generation speed: 25.79 tokens per second

Prompt length: 512, max length: 256
Generation took 10.82 seconds
Average generation speed: 23.66 tokens per second

Additionally I ran python3.10 generate.py --model ../model/ --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 1.99 --top-p 0.18 --repetition-penalty 1.15 --max-length 512 a few times:

Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:39<00:00,  3.33s/it]
Done.
Write a story about a duck: Once upon a time there was a duck. He lived in a swamp, but not this kind of.... How many ducks were hiding? Every Sunday afternoon a mysterious man would show up with all kinds and numbers... Warm Wednesday Afternoon Weather Readers today did activities that go along with each letter we are learning in their first and second grade rooms! B W is to celebrate because no books came off our libraries (that's for school librarians... Kelsey Lane Forgiveness Class Earlier in the week everyone from grade one though fifth was visited by Mrs R. Dunlop as he helped lead an important class.... Se

Generation took 5.01 seconds
Total tokens generated: 145
Average generation speed: 28.96 tokens per second

Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00,  3.34s/it]
Done.
Write a story about a duck: Once upon a time there was a duck...
10 Telling Signs to Identify High Potential Teachers [INFOGRAPHIC]
by Jonathan Harnum on February 28, 2013 ,   comment

Generation took 1.94 seconds
Total tokens generated: 63
Average generation speed: 32.39 tokens per second

Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:39<00:00,  3.33s/it]
Done.
Write a story about a duck: Once upon a time there was a duck and she was happy but then everything got all fuzzy because someone in their bed, said that. Then things were normal again and the main duck lady ducks and 42 chicks got married, 1-a quill for their one night stands they did before they found love...or sexually charged a pen as it is with same sex attraction...sudden urge to type and well lets face ot my 6 hours are up...

Generation took 3.89 seconds
Total tokens generated: 114
Average generation speed: 29.33 tokens per second

Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00,  3.34s/it]
Done.
Write a story about a duck: Once upon a time there was a duck that could do ____________ and then, boom! Kwl." They just wrote them for an hour. It got more sophisticated from there to full written sentences but it's nice not being too critical when the person who wrote this one has been writing "Daisy wants candy," all day so she knows that word, I don’t think we had too much control over these either - from here, here is his less than inspired sentence but by golly that duck picture shows her a bunch as if there weren’t so many other options…. Maybe I need help. But oh how happy! When DK can write a proper noun at least. (Yes mom and dad came in, sorry)

Generation took 6.11 seconds
Total tokens generated: 173
Average generation speed: 28.30 tokens per second

I tried cuda aswell (using ooba's webui, this time with --no-stream enabled, and again with the same parameters as in the duck prompt command above):

Output generated in 1.24 seconds (18.60 tokens/s, 23 tokens, context 18)
Output generated in 10.00 seconds (46.09 tokens/s, 461 tokens, context 18)
Output generated in 0.69 seconds (33.28 tokens/s, 23 tokens, context 18)
Output generated in 2.65 seconds (44.61 tokens/s, 118 tokens, context 18)
Output generated in 0.64 seconds (32.82 tokens/s, 21 tokens, context 18)
Output generated in 1.60 seconds (43.09 tokens/s, 69 tokens, context 18)
Output generated in 1.31 seconds (41.21 tokens/s, 54 tokens, context 18)
Output generated in 0.65 seconds (32.34 tokens/s, 21 tokens, context 18)
Output generated in 3.97 seconds (46.39 tokens/s, 184 tokens, context 18)
Output generated in 0.61 seconds (31.06 tokens/s, 19 tokens, context 18)
Output generated in 0.61 seconds (31.21 tokens/s, 19 tokens, context 18)
Output generated in 4.22 seconds (46.17 tokens/s, 195 tokens, context 18)
Output generated in 4.43 seconds (46.46 tokens/s, 206 tokens, context 19)

still seems to be a bit slower than cuda

1aienthusiast commented 1 year ago

i re-converted the model the results are the same:

python3.10 generate.py --model ./models/ --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 1.99 --top-p 0.18 --repetition-penalty 1.15 --max-length 512
Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:39<00:00,  3.33s/it]
Done.
Write a story about a duck: Once upon a time there was a duck and this wacky space-cadet shirt. On each leg she wrote different details for that picture! My favourite of our days with no screens, in terms of quality screen use. Pip gave herself just two hours, then told them if I couldn't make more then that we were off electronics until tomorrow morning when it began again at 9am (a three hour period). Then they made me an afternoon tea of raspberries in meringue, smothered almond cake slices frozen in Nutella so you don’t get as much of the nasties like sugar - the boys loved this so are going to try baking some for their dads/mums using recipes and lots on their learning!
There has been lots too on line so check us out!! It might have something worth looking at, some posts will need pre work to understand the links and resources. It's early though luvs...so my plan for Sunday Morning...Forgive him all is forgiven to celebrate the power & hope from prayer, knowing everything works for our highest good no matter what the obstacles.. (Sermon Sundi ng )Mini playtime writing....An ABC (Art Beats Colour) Of God! Check #JA10 days into Feb; why join #just2write, on being courageous enough not yet taking anything else away??? Get things changed for free + live + survive etc… just your brain cells are occupied?!! Keep those who make social changes rather small by limit energy like it can? No we believe it works the opposite! That only just for people matters gets bigger too - in every person. No easy fix answer btw either.... If something had never begun but became that…it was very important???? Our challenge will help grow awareness. Is how best the results unfold!! Yes. Can the potential solve this though in ways too challenges more faith which we want our futures right to not miss because nobody has found any flaws on planning alone..

Generation took 16.67 seconds
Total tokens generated: 448
Average generation speed: 26.88 tokens per second

Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
 75%|███████████████████████████████████████████████████████████████████████▎                       | 9/12  83%|██████████████████████████████████████████████████████████████████████████████▎               | 10/12  92%|██████████████████████████████████████████████████████████████████████████████████████▏       | 11/12 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00,  3.34s/it]
Done.
Write a story about a duck: Once upon a time there was a duck. It flew to Albuquerque and fell in the Rio Grande River on September 12, its heart beating with fury.

Generation took 1.43 seconds
Total tokens generated: 49
Average generation speed: 34.32 tokens per second

Loading model ...
Found 2 unique N values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:40<00:00,  3.34s/it]
Done.
Write a story about a duck: Once upon a time there was a duck who fell in love with Alice’s cat—yes! a very short summary or description o [M/M preferred] Idea by Fudelmae. Novo Orihisa akan selalu menjadi hero dalam Cerita Banyon novel ini dengan kepekaun gayaan yang mirip Ah Pook Is Here terlihat cinta batu melarak Keefre dengan gadai tik tali yinban lemonah awal dilacak kawan itu sendiri memili hendak makhlup merepi Sapphu luka: andu, at the disobey Lelaeg Itas sentriklahnya Ai Shinuhu apalang canda (balepan beradanya eks seperih Cik Naiji naa selagi selpatu punah), meme […]. Story has as character Dash Sanguinade from Blizzplanet; her 32cats here Blablatron and Gashvill, which uses Wrath alt Haskul and Leecoa on main, Tek'nah for example. The person I really liked first and still likes? Feud Lord, for example – just flip away when other guys' ladies turn down these moments like, this is their mistake :')". Kaitlyn thought to ask something to Nick but she could say any of idea right now, let forget his screamings lyk 'Daughter!'' and remember him more fun at home with me". Seeing Yuki kiss Youshan. Read "A Change", by Rebrand Tries or find articles, download, biologist djukkeet and youselper yoku shovoo some, they also hold much chachma zilha people take on you just rathken day take into accout how beautiful yall have been doog ya?": saturday seculityy gocbe ertyt has been little trouble because inculted albaniear to under atwo gmail ehnfib ngmbael for something newbie see fkisht by Nori i knew her do want tell know much. Stella had decided last morning herself to give what sort what vdushoch would she consider once get out the best thing ycle? at night because I’t tired most here since foreseent me only want meet people

Generation took 20.08 seconds
Total tokens generated: 529
Average generation speed: 26.34 tokens per second

fpgaminer commented 1 year ago

Thank you for the report, it's helpful.

I've been digging at this. Two things. The numbers reported by text-generation-webui are higher than the CUDA kernel achieves on my benchmarks even on my 3090, so I suspect there's some apples-oranges here and either text-generation-webui is applying optimizations of its own or it's measuring speed oddly.

Either way, on my generate benchmark the CUDA kernel does indeed perform slightly faster than Triton when the prompt is small. Quite odd since in isolation the Triton kernel is faster. So it's only slower in situ.

I rigged up the model and the results consistently show the Triton kernel being slower at performing the q_proj, for example, and only barely beating CUDA on other projections. This despite me tweaking settings and getting the Triton kernel to be twice as fast as the CUDA kernel on the isolated Benchmark notebook. (Beating even FP16 performance).

Very odd.

I'll update once I've cracked the problem.

fpgaminer commented 1 year ago

As of my latest commit with some more optimizations, I've gotten the Triton kernel to beat CUDA in all cases on the generate_benchmark.py benchmark. This is on my 3090. I haven't tested other GPUs, though I doubt the 3060 would behave differently.

I'll take a closer look at text-generation-webui next.

1aienthusiast commented 1 year ago

for some reason it appears to be much slower than last time, the benchmark reports high tokens/s and yet it takes like a few minutes to generate prompts:

python benchmark_generate.py --model ./ --quant
Loading model ...
Found 4 unique KN values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [01:08<00:00,  5.67s/it]
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Prompt length: 1024, max length: 16
Average generation time: 1.19 seconds
Average generation speed: 13.46 tokens per second

Prompt length: 4, max length: 512
Average generation time: 12.91 seconds
Average generation speed: 39.65 tokens per second

Prompt length: 32, max length: 1
Average generation time: 0.03 seconds
Average generation speed: 29.72 tokens per second

Prompt length: 32, max length: 8
Average generation time: 0.20 seconds
Average generation speed: 39.69 tokens per second

Prompt length: 32, max length: 512
Average generation time: 12.93 seconds
Average generation speed: 39.59 tokens per second

Prompt length: 64, max length: 8
Average generation time: 0.22 seconds
Average generation speed: 36.78 tokens per second

Prompt length: 4, max length: 4
Average generation time: 0.09 seconds
Average generation speed: 42.36 tokens per second

as if there's something happening before it generates the prompt

1aienthusiast commented 1 year ago

seems like the generate.py command isn't affected by this

./generate.py --model ./ --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 0.6 --top-p 0.6 --repetition-penalty 1.1
Loading model ...
Found 4 unique KN values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████| 12/12 [01:07<00:00,  5.60s/it]
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Write a story about a duck: Once upon a time there was a duck. The duck went to the park and saw another duck. The other duck said, “I am your friend.” And then they played together.
The children will be given a sheet of paper with a picture of a duck on it. They are asked to write a story about that duck.
After writing their stories, the children are given a second sheet of paper. On this page, they are asked to draw a picture of what happened in their story.
This is an example of a completed Duck Story Sheet.
When the children have finished drawing their pictures, they are told to cut out the pictures and glue them onto a piece of construction paper.
A sample of the final product can be seen below.

Generation took 3.94 seconds
Total tokens generated: 172
Average generation speed: 43.63 tokens per second

quite faster than last time

maybe the benchmark.py warms up the autotune cache every time it generates a prompt?

1aienthusiast commented 1 year ago

another test with generate.py:

./generate.py --model ./ --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 0.6 --top-p 0.6 --repetition-penalty 1.1
Loading model ...
Found 4 unique KN values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████| 12/12 [01:07<00:00,  5.64s/it]
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Write a story about a duck: Once upon a time there was a duck. The duck had no wings, so it couldn’t fly. One day the duck decided to go on an adventure and see what else is out there in the world. It flew for hours until it finally reached the ocean. The duck swam around for a while, but then it got tired and started to float back home. When it got home, it saw that its house was gone! The duck didn’t know where to go or what to do now. So it floated away again, hoping to find somewhere else to live. But it never found anywhere else to live. The duck eventually died of starvation.
This is a good example of a plot.
A plot is the series of events that make up a story.
In this story, the duck has no wings, so it can’t fly.
The duck goes on an adventure.
It gets tired and starts floating back home.
When it gets home, it sees that its house is gone!
The duck doesn’t know where to go or what to do now.
So it floats away again, hoping to find somewhere else to live.
But it never finds anywhere else to live.
The duck eventually dies of starvation.
A plot is the series of events that make up a story. In this story, the duck has no wings, so it can’t fly. The duck goes on an adventure. It gets tired and starts floating back home. When it gets home, it sees that its house is gone! The duck doesn’t know where to go or what to do now. So it floats away again, hoping to find somewhere else to live. But it never finds anywhere else to live. The duck eventually dies of starvation.

Generation took 9.49 seconds
Total tokens generated: 398
Average generation speed: 41.94 tokens per second

nevermind, my fault for not noticing the --average option it definitely works better after your update:

python benchmark_generate.py --model ./ --quant
Loading model ...
Found 4 unique KN values.
Warming up autotune cache ...
100%|██████████████████████████████████████████████████████████| 12/12 [01:07<00:00,  5.64s/it]
Done.
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Prompt length: 2, max length: 512
Average generation time: 12.87 seconds
Average generation speed: 39.78 tokens per second

Prompt length: 64, max length: 32
Average generation time: 0.79 seconds
Average generation speed: 40.73 tokens per second

Prompt length: 32, max length: 1024
Average generation time: 27.88 seconds
Average generation speed: 36.72 tokens per second

Prompt length: 1, max length: 1024
Average generation time: 27.43 seconds
Average generation speed: 37.33 tokens per second

Prompt length: 256, max length: 4
Average generation time: 0.26 seconds
Average generation speed: 15.45 tokens per second

Prompt length: 512, max length: 8
Average generation time: 0.55 seconds
Average generation speed: 14.67 tokens per second

Prompt length: 8, max length: 1024
Average generation time: 27.23 seconds
Average generation speed: 37.60 tokens per second

Prompt length: 256, max length: 32
Average generation time: 0.97 seconds
Average generation speed: 33.15 tokens per second

Prompt length: 64, max length: 1024
Average generation time: 28.05 seconds
Average generation speed: 36.51 tokens per second

Prompt length: 1, max length: 512
Average generation time: 12.84 seconds
Average generation speed: 39.87 tokens per second

Prompt length: 2048, max length: 256
Average generation time: 11.02 seconds
Average generation speed: 23.22 tokens per second

Prompt length: 8, max length: 64
Average generation time: 1.51 seconds
Average generation speed: 42.34 tokens per second

Prompt length: 2, max length: 16
Average generation time: 0.38 seconds
Average generation speed: 42.59 tokens per second

Prompt length: 32, max length: 32
Average generation time: 0.77 seconds
Average generation speed: 41.80 tokens per second

1aienthusiast commented 1 year ago

ooba's webui results for comparing to generate.py:

Output generated in 3.80 seconds (46.90 tokens/s, 178 tokens, context 18)
Output generated in 5.05 seconds (48.11 tokens/s, 243 tokens, context 18)
Output generated in 7.38 seconds (47.30 tokens/s, 349 tokens, context 18)
Output generated in 3.05 seconds (46.62 tokens/s, 142 tokens, context 18)
Output generated in 3.52 seconds (46.86 tokens/s, 165 tokens, context 18)
Output generated in 2.86 seconds (46.09 tokens/s, 132 tokens, context 18)
Output generated in 10.60 seconds (46.59 tokens/s, 494 tokens, context 18)
Output generated in 11.02 seconds (46.48 tokens/s, 512 tokens, context 18)

fpgaminer commented 1 year ago

I got gptq-triton running with text-generation-webui and was able to benchmark it on my machine. Below are the numbers I'm seeing. The GPTQ-for-LLaMa numbers on my 3090 are slower than the ones you're seeing. Are you running with --xformers perhaps?

GPTQ-for-LLaMa

Output generated in 6.61 seconds (30.24 tokens/s, 200 tokens, context 32, seed 2055541194)
Output generated in 2.16 seconds (35.25 tokens/s, 76 tokens, context 32, seed 1636672766)
Output generated in 5.44 seconds (36.76 tokens/s, 200 tokens, context 32, seed 979609827)
Output generated in 2.46 seconds (35.82 tokens/s, 88 tokens, context 32, seed 1611163610)
Output generated in 5.43 seconds (36.83 tokens/s, 200 tokens, context 32, seed 60268958)
Output generated in 1.64 seconds (37.12 tokens/s, 61 tokens, context 32, seed 900847047)
Output generated in 5.14 seconds (38.94 tokens/s, 200 tokens, context 32, seed 1341066675)
Output generated in 5.24 seconds (38.14 tokens/s, 200 tokens, context 32, seed 906696475)
Output generated in 5.36 seconds (37.32 tokens/s, 200 tokens, context 32, seed 764893968)
Output generated in 4.05 seconds (39.24 tokens/s, 159 tokens, context 32, seed 252186806)
Output generated in 5.12 seconds (39.08 tokens/s, 200 tokens, context 32, seed 918576695)
Output generated in 5.15 seconds (38.87 tokens/s, 200 tokens, context 32, seed 83566943)
Output generated in 5.38 seconds (37.17 tokens/s, 200 tokens, context 32, seed 524069681)

GPTQ-triton

Output generated in 5.64 seconds (35.49 tokens/s, 200 tokens, context 32, seed 1560940026)
Output generated in 4.30 seconds (44.84 tokens/s, 193 tokens, context 32, seed 338295735)
Output generated in 2.37 seconds (45.09 tokens/s, 107 tokens, context 32, seed 139857067)
Output generated in 4.48 seconds (44.60 tokens/s, 200 tokens, context 32, seed 1058638840)
Output generated in 2.06 seconds (44.69 tokens/s, 92 tokens, context 32, seed 2070673384)
Output generated in 4.53 seconds (44.16 tokens/s, 200 tokens, context 32, seed 856013737)
Output generated in 4.49 seconds (44.59 tokens/s, 200 tokens, context 32, seed 1382018458)
Output generated in 4.46 seconds (44.83 tokens/s, 200 tokens, context 32, seed 1008794244)
Output generated in 4.55 seconds (43.97 tokens/s, 200 tokens, context 32, seed 451346737)
Output generated in 3.21 seconds (45.16 tokens/s, 145 tokens, context 32, seed 650708045)
Output generated in 4.51 seconds (44.31 tokens/s, 200 tokens, context 32, seed 1936148497)
Output generated in 4.43 seconds (45.12 tokens/s, 200 tokens, context 32, seed 1126897545)

Setup:

3090
Docker using Dockerfile from text-generation-webui
CUDA 11.7
text-generation-webui: c4aa1a42b156b9c5ddcfb060cc497b2fba55430f
transformers: 5506d0496957cde19318eee3d34ee682b654abe8
GPTQ-triton: 65ae71eb558fa9f42033a414b695eb2a8670e4a0
oobabooga/GPTQ-for-LLaMa: 57a26292ed583528d9941e79915824c5af012279
--model llama-7b-4bit --gptq-triton --listen --no-stream
--model llama-7b-4bit-cuda --wbits 4 --listen --no-stream

1aienthusiast commented 1 year ago

I got gptq-triton running with text-generation-webui and was able to benchmark it on my machine. Below are the numbers I'm seeing. The GPTQ-for-LLaMa numbers on my 3090 are slower than the ones you're seeing. Are you running with --xformers perhaps?

GPTQ-for-LLaMa

Output generated in 6.61 seconds (30.24 tokens/s, 200 tokens, context 32, seed 2055541194)
Output generated in 2.16 seconds (35.25 tokens/s, 76 tokens, context 32, seed 1636672766)
Output generated in 5.44 seconds (36.76 tokens/s, 200 tokens, context 32, seed 979609827)
Output generated in 2.46 seconds (35.82 tokens/s, 88 tokens, context 32, seed 1611163610)
Output generated in 5.43 seconds (36.83 tokens/s, 200 tokens, context 32, seed 60268958)
Output generated in 1.64 seconds (37.12 tokens/s, 61 tokens, context 32, seed 900847047)
Output generated in 5.14 seconds (38.94 tokens/s, 200 tokens, context 32, seed 1341066675)
Output generated in 5.24 seconds (38.14 tokens/s, 200 tokens, context 32, seed 906696475)
Output generated in 5.36 seconds (37.32 tokens/s, 200 tokens, context 32, seed 764893968)
Output generated in 4.05 seconds (39.24 tokens/s, 159 tokens, context 32, seed 252186806)
Output generated in 5.12 seconds (39.08 tokens/s, 200 tokens, context 32, seed 918576695)
Output generated in 5.15 seconds (38.87 tokens/s, 200 tokens, context 32, seed 83566943)
Output generated in 5.38 seconds (37.17 tokens/s, 200 tokens, context 32, seed 524069681)

GPTQ-triton

Output generated in 5.64 seconds (35.49 tokens/s, 200 tokens, context 32, seed 1560940026)
Output generated in 4.30 seconds (44.84 tokens/s, 193 tokens, context 32, seed 338295735)
Output generated in 2.37 seconds (45.09 tokens/s, 107 tokens, context 32, seed 139857067)
Output generated in 4.48 seconds (44.60 tokens/s, 200 tokens, context 32, seed 1058638840)
Output generated in 2.06 seconds (44.69 tokens/s, 92 tokens, context 32, seed 2070673384)
Output generated in 4.53 seconds (44.16 tokens/s, 200 tokens, context 32, seed 856013737)
Output generated in 4.49 seconds (44.59 tokens/s, 200 tokens, context 32, seed 1382018458)
Output generated in 4.46 seconds (44.83 tokens/s, 200 tokens, context 32, seed 1008794244)
Output generated in 4.55 seconds (43.97 tokens/s, 200 tokens, context 32, seed 451346737)
Output generated in 3.21 seconds (45.16 tokens/s, 145 tokens, context 32, seed 650708045)
Output generated in 4.51 seconds (44.31 tokens/s, 200 tokens, context 32, seed 1936148497)
Output generated in 4.43 seconds (45.12 tokens/s, 200 tokens, context 32, seed 1126897545)

Setup:

* 3090

* Docker using Dockerfile from text-generation-webui

* CUDA 11.7

* text-generation-webui: c4aa1a42b156b9c5ddcfb060cc497b2fba55430f

* transformers: 5506d0496957cde19318eee3d34ee682b654abe8

* GPTQ-triton: [65ae71e](https://github.com/fpgaminer/GPTQ-triton/commit/65ae71eb558fa9f42033a414b695eb2a8670e4a0)

* oobabooga/GPTQ-for-LLaMa: 57a26292ed583528d9941e79915824c5af012279

* `--model llama-7b-4bit --gptq-triton --listen --no-stream`

* `--model llama-7b-4bit-cuda --wbits 4 --listen --no-stream`

I dont use --xformers, i run the webui with:

python3.10 server.py --wbits 4 --model llama-7b-4bit --no-stream

fpgaminer / GPTQ-triton

Cuda vs Triton on an RTX 3060 12GB #5