Closed SalomonKisters closed 1 month ago
Thanks for reporting! I can reproduce the issue and I believe I have a fix. It'll take a bit to finish up the other stuff I'm working on, get this fix in, and make a release. But if you want to use it now, you can build aphrodite from source (clone the repo and run pip install -e .
) then modify vocab_parallel_embedding.py
at line 92:
index becd6f9..20db81e 100644
--- a/aphrodite/modeling/layers/vocab_parallel_embedding.py
+++ b/aphrodite/modeling/layers/vocab_parallel_embedding.py
@@ -91,16 +91,24 @@ class VocabParallelEmbedding(torch.nn.Module):
def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
output_dim = getattr(param, "output_dim", None)
+ packed_dim = getattr(param, "packed_dim", None)
if output_dim is not None:
- assert loaded_weight.shape[output_dim] == self.org_vocab_size
- loaded_weight = loaded_weight.narrow(
- output_dim, self.vocab_start_index,
- min(self.vocab_end_index - self.vocab_start_index,
- self.org_vocab_size - self.vocab_start_index))
+ shard_offset = self.vocab_start_index
+ shard_size = min(self.vocab_end_index,
+ self.org_vocab_size) - shard_offset
+ if packed_dim == output_dim:
+ shard_size = shard_size // param.pack_factor
+ shard_offset = shard_offset // param.pack_factor
+ loaded_weight = loaded_weight.narrow(output_dim, shard_offset,
+ shard_size)
if isinstance(param, torch.nn.parameter.UninitializedParameter):
vocab_shape = list(loaded_weight.shape)
if output_dim is not None:
- vocab_shape[output_dim] = self.num_embeddings_per_partition
+ if packed_dim == output_dim:
+ vocab_shape[
+ output_dim] = self.num_embeddings_per_partition // param.pack_factor
+ else:
+ vocab_shape[output_dim] = self.num_embeddings_per_partition
param.materialize(vocab_shape, dtype=loaded_weight.dtype)
if output_dim is not None:
param.data.narrow(
Thanks for your quick response! I will try it out ASAP and get back to you :)
No problem. Seems to be a problem with this quant specifically, or rather this type. Works with tinyllama exl2 for example.
Thanks to this issue, I may have found a solution to the exllamav2 tensor parallel roadblock I hit in #375
Great, happy to help! I suppose we will be seeing some improvements there soon then? :) Your fix worked, btw - it is running now. However, I believe I will keep using AWQ for now due to the higher tokens/s. These would be the benchmarks on it btw (relatively informal):
Aphrodite Bench 4bpw Openhermes-2.5 on RTX 3090, trx 3960x, 64gb ddr4, ubuntu:
48 parallel requests:
single request:
GPTQ is generally faster than exl2 because it's a simpler quant format. You're also using a 5bit quant for exl2, while the GPTQ/AWQ ones are 4bit.
EDIT: ah wait didn't notice you said 4bpw
Yeah at first I had 5bpw, but I then changed it, wouldnt really be fair otherwise ;) Are you sure exl2 should be slower? In my experience its about the same, there are also some benchmarks like this one: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
Another question: Do you have an idea why GPTQ seems to be faster for single but slower for multiple requests? Seems pretty unintuitive
You may read the "ExLlama v1 vs ExLlama v2 GPTQ speed (update)" section of the ooba's blog
So GPTQ through ExLlamav2 is actually the model with the fastest evaluation speed of all
@SalomonKisters @AlpinDale
Getting the same error with exl2 and command-r model (turboderp/command-r-v01-35B-exl2):
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/user/vllm/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 621, in <module>
engine = AsyncAphrodite.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in __init__
self.model_executor = executor_class(model_config, cache_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 51, in __init__
self._init_worker()
File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 86, in _init_worker
self.driver_worker.load_model()
File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/worker.py", line 108, in load_model
self.model_runner.load_model()
File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 134, in load_model
self.model = get_model(
^^^^^^^^^^
File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/loader.py", line 98, in get_model
model.load_weights(
File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 340, in load_weights
param = params_dict[name]
~~~~~~~~~~~^^^^^^
KeyError: 'lm_head.q_groups'
Have you tried the solution above? That worked for me.
Have you tried the solution above? That worked for me.
Didn't work as I'm not using Llama. I'm getting the same error with and without --quantization exl2
This should be fixed since 638547ec9826fdbc061a2fc8c92e9f2e2d367481
Your current environment
🐛 Describe the bug
To reproduce:
Using quantization=exl2 Flag
pip3 install aphrodite-engine
python -m aphrodite.endpoints.openai.api_server --model LoneStriker/OpenHermes-2-Mistral-7B-5.0bpw-h6-exl2 --quantization=exl2
It should produce this error:
Trying it without --quantization=exl2 flag:
pip3 install aphrodite-engine
python -m aphrodite.endpoints.openai.api_server --model LoneStriker/OpenHermes-2-Mistral-7B-5.0bpw-h6-exl2
It should produce this error:
Funnily enough, this is the exact same output I got from vllm when trying to run an exl2 model there as a test.
What I already tried: