[Bug]: Exllama v2 not working

SalomonKisters commented 1 month ago

Your current environment

salomon@salomon-linux-mint:~$ python env.py
Collecting environment information...
PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Linux Mint 21.3 (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect 
CMake version: Could not collect 
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.5.119
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             48
On-line CPU(s) list:                0-47
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen Threadripper 3960X 24-Core Processor
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          1
Stepping:                           0
Frequency boost:                    enabled
CPU max MHz:                        3800.0000
CPU min MHz:                        2200.0000
BogoMIPS:                           7585.87
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                     AMD-V
L1d cache:                          768 KiB (24 instances)
L1i cache:                          768 KiB (24 instances)
L2 cache:                           12 MiB (24 instances)
L3 cache:                           128 MiB (8 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-47
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.15.0
[pip3] onnxruntime==1.17.0
[pip3] torch==2.2.0
[pip3] torchaudio==2.2.2
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: 0.5.1
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

To reproduce:

Using quantization=exl2 Flag

pip3 install aphrodite-engine
python -m aphrodite.endpoints.openai.api_server --model LoneStriker/OpenHermes-2-Mistral-7B-5.0bpw-h6-exl2 --quantization=exl2

It should produce this error:

WARNING:  exl2 quantization is not fully optimized yet. The speed can be slower than non-quantized 
models.
INFO:     Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO:     Model = 'LoneStriker/OpenHermes-2-Mistral-7B-5.0bpw-h6-exl2'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = exl2
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Model is quantized. Forcing float16 datatype.
INFO:     Downloading model weights ['*.safetensors']
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 115, in __init__
    self._init_workers()
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 157, in _init_workers
    self._run_workers("load_model")
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/task_handler/worker.py", line 112, in load_model
    self.model_runner.load_model()
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/task_handler/model_runner.py", line 121, in load_model
    self.model = get_model(self.model_config, self.device_config,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/modeling/loader.py", line 91, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/modeling/models/llama.py", line 473, in load_weights
    weight_loader(param, loaded_weight)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/modeling/layers/vocab_parallel_embedding.py", line 95, in weight_loader
    assert loaded_weight.shape[output_dim] == self.org_vocab_size
AssertionError

Trying it without --quantization=exl2 flag:

pip3 install aphrodite-engine
python -m aphrodite.endpoints.openai.api_server --model LoneStriker/OpenHermes-2-Mistral-7B-5.0bpw-h6-exl2

It should produce this error:

INFO:     Initializing the Aphrodite Engine (v0.5.1) with the following config:
INFO:     Model = 'LoneStriker/OpenHermes-2-Mistral-7B-5.0bpw-h6-exl2'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:     Downloading model weights ['*.safetensors']
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/endpoints/openai/api_server.py", line 563, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 115, in __init__
    self._init_workers()
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 157, in _init_workers
    self._run_workers("load_model")
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/engine/aphrodite_engine.py", line 1028, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/task_handler/worker.py", line 112, in load_model
    self.model_runner.load_model()
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/task_handler/model_runner.py", line 121, in load_model
    self.model = get_model(self.model_config, self.device_config,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/modeling/loader.py", line 91, in get_model
    model.load_weights(model_config.model, model_config.download_dir,
  File "/home/salomon/.local/lib/python3.10/site-packages/aphrodite/modeling/models/llama.py", line 470, in load_weights
    param = params_dict[name]
KeyError: 'lm_head.q_groups'

Funnily enough, this is the exact same output I got from vllm when trying to run an exl2 model there as a test.

What I already tried:

installing from source using virtual environment
install from source locally (pip install -e ...)
setting up new virtual environment and trying it there
Using the docker image
Running a non-exl2 model. This worked fine each time.

AlpinDale commented 1 month ago

Thanks for reporting! I can reproduce the issue and I believe I have a fix. It'll take a bit to finish up the other stuff I'm working on, get this fix in, and make a release. But if you want to use it now, you can build aphrodite from source (clone the repo and run pip install -e .) then modify vocab_parallel_embedding.py at line 92:

index becd6f9..20db81e 100644
--- a/aphrodite/modeling/layers/vocab_parallel_embedding.py
+++ b/aphrodite/modeling/layers/vocab_parallel_embedding.py
@@ -91,16 +91,24 @@ class VocabParallelEmbedding(torch.nn.Module):

     def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):
         output_dim = getattr(param, "output_dim", None)
+        packed_dim = getattr(param, "packed_dim", None)
         if output_dim is not None:
-            assert loaded_weight.shape[output_dim] == self.org_vocab_size
-            loaded_weight = loaded_weight.narrow(
-                output_dim, self.vocab_start_index,
-                min(self.vocab_end_index - self.vocab_start_index,
-                    self.org_vocab_size - self.vocab_start_index))
+            shard_offset = self.vocab_start_index
+            shard_size = min(self.vocab_end_index,
+                             self.org_vocab_size) - shard_offset
+            if packed_dim == output_dim:
+                shard_size = shard_size // param.pack_factor
+                shard_offset = shard_offset // param.pack_factor
+            loaded_weight = loaded_weight.narrow(output_dim, shard_offset,
+                                                 shard_size)
         if isinstance(param, torch.nn.parameter.UninitializedParameter):
             vocab_shape = list(loaded_weight.shape)
             if output_dim is not None:
-                vocab_shape[output_dim] = self.num_embeddings_per_partition
+                if packed_dim == output_dim:
+                    vocab_shape[
+                        output_dim] = self.num_embeddings_per_partition // param.pack_factor
+                else:
+                    vocab_shape[output_dim] = self.num_embeddings_per_partition
             param.materialize(vocab_shape, dtype=loaded_weight.dtype)
         if output_dim is not None:
             param.data.narrow(

SalomonKisters commented 1 month ago

Thanks for your quick response! I will try it out ASAP and get back to you :)

AlpinDale commented 1 month ago

No problem. Seems to be a problem with this quant specifically, or rather this type. Works with tinyllama exl2 for example.

Thanks to this issue, I may have found a solution to the exllamav2 tensor parallel roadblock I hit in #375

SalomonKisters commented 1 month ago

Great, happy to help! I suppose we will be seeing some improvements there soon then? :) Your fix worked, btw - it is running now. However, I believe I will keep using AWQ for now due to the higher tokens/s. These would be the benchmarks on it btw (relatively informal):

Aphrodite Bench 4bpw Openhermes-2.5 on RTX 3090, trx 3960x, 64gb ddr4, ubuntu:

48 parallel requests:

GPTQ: 850-900 tok/s,
AWQ: 1250-1350 tok/s,
exl2: 550-700 tok/s

single request:

AWQ: 100-115 tok/s,
single request GPTQ: 135-147 tok/s
single request exl2: 80-115 tok/s

AlpinDale commented 1 month ago

GPTQ is generally faster than exl2 because it's a simpler quant format. You're also using a 5bit quant for exl2, while the GPTQ/AWQ ones are 4bit.

EDIT: ah wait didn't notice you said 4bpw

SalomonKisters commented 1 month ago

Yeah at first I had 5bpw, but I then changed it, wouldnt really be fair otherwise ;) Are you sure exl2 should be slower? In my experience its about the same, there are also some benchmarks like this one: https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/

Another question: Do you have an idea why GPTQ seems to be faster for single but slower for multiple requests? Seems pretty unintuitive

sgsdxzy commented 1 month ago

You may read the "ExLlama v1 vs ExLlama v2 GPTQ speed (update)" section of the ooba's blog

So GPTQ through ExLlamav2 is actually the model with the fastest evaluation speed of all

ccdv-ai commented 1 month ago

@SalomonKisters @AlpinDale

Getting the same error with exl2 and command-r model (turboderp/command-r-v01-35B-exl2):

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/vllm/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 621, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in __init__
    self.model_executor = executor_class(model_config, cache_config,
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 51, in __init__
    self._init_worker()
  File "/home/user/vllm/aphrodite-engine/aphrodite/executor/gpu_executor.py", line 86, in _init_worker
    self.driver_worker.load_model()
  File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/worker.py", line 108, in load_model
    self.model_runner.load_model()
  File "/home/user/vllm/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 134, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/loader.py", line 98, in get_model
    model.load_weights(
  File "/home/user/vllm/aphrodite-engine/aphrodite/modeling/models/cohere.py", line 340, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'lm_head.q_groups'

SalomonKisters commented 1 month ago

Have you tried the solution above? That worked for me.

ccdv-ai commented 1 month ago

Have you tried the solution above? That worked for me.

Didn't work as I'm not using Llama. I'm getting the same error with and without --quantization exl2

sgsdxzy commented 1 month ago

This should be fixed since 638547ec9826fdbc061a2fc8c92e9f2e2d367481

PygmalionAI / aphrodite-engine