Open casper-hansen opened 1 year ago
https://github.com/qwopqwop200/AutoAWQ-exllama I succeeded in running exllama in AutoAWQ. Additionally, some minor changes to the exllama kernel were required. Performance at opt-125m is:
awq kernel
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
wikitext | 1 | word_perplexity | 33.9570 | ||
byte_perplexity | 1.9333 | ||||
bits_per_byte | 0.9510 |
[======] Model summary: opt-125m-awq [======] Load time: 2.66 seconds Context speed: 10473.90 tokens/second (0.10 ms/token) Generation speed: 118.32 tokens/second (8.45 ms/token) VRAM: 255.58 MB
exllama kernel
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
wikitext | 1 | word_perplexity | 33.9579 | ||
byte_perplexity | 1.9333 | ||||
bits_per_byte | 0.9510 |
[======] Model summary: opt-125m-awq [======] Load time: 2.70 seconds Context speed: 8750.52 tokens/second (0.11 ms/token) Generation speed: 131.00 tokens/second (7.63 ms/token) VRAM: 255.58 MB
It was tested in the following.
wsl (window 11) cuda 11.3 pytorch 2.0.1+cuda 11.7 RTX 3090 + R7 5800x
This is good work @qwopqwop200. I was working on the same thing on the exllama branch. It seems there could be a modest boost in speed of around 10% from your initial testing.
Do you want to open a PR or can I copy your work into the exllama branch?
Copy it to exllama branch. I'm not sure yet, but it seems that exllama and awq kerenl have different weight storage methods. This may be why exllama is not working.
I have gone through your implementation now and unfortunately, it seems it runs into the same issues around the shapes of the in_features and out_features. I have fixed these for now in the exllama branch, but I still need to make the fused modules work.
If you have time to spare @qwopqwop200 and want to help with the exllama integration, I would appreciate it if you could work from this branch. https://github.com/casper-hansen/AutoAWQ/tree/exllama
A few issues:
in_features == out_features
Draft PR #30 is now open.
ExLlama has implemented very optimized CUDA kernels. We should import the kernels to see just how efficient it could be in AWQ.
https://github.com/turboderp/exllama/blob/master/exllama_ext/exllama_ext.cpp#L199