PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
926 stars 100 forks source link

[Bug]: Problem loading EXL2 in rc_054 #561

Open nikitabalakin opened 1 month ago

nikitabalakin commented 1 month ago

Your current environment

The output of `python env.py`
Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35
Python version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 560.70
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 9 3900XT 12-Core Processor
CPU family:                         23
Model:                              113
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           7585.77
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip rdpid
Virtualization:                     AMD-V
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          384 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           6 MiB (12 instances)
L3 cache:                           16 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[conda] Could not collect ROCM Version: Could not collect
Aphrodite Version: 0.5.3
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

When trying to run any exl2 model, the following error occurs, even if you manually download the model and specify the path for it:

(AphroditeWorkerProcess pid=199758) INFO:     Using model weights format ['*.safetensors']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/kotokin/aphrodite-engine/conda/envs/aphrodite-runtime/bin/aphrodite", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/endpoints/cli.py", line 25, in main
[rank0]:     args.func(args)
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 542, in run_server
[rank0]:     engine = AsyncAphrodite.from_engine_args(engine_args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 386, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 461, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 147, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:                           ^^^^^^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 135, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/executor/distributed_gpu_executor.py", line 24, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/executor/executor_base.py", line 42, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 62, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 113, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/task_handler/worker.py", line 124, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 161, in load_model
[rank0]:     self.model = get_model(
[rank0]:                  ^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/modeling/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/modeling/model_loader/loader.py", line 242, in load_model
[rank0]:     self._get_weights_iterator(model_config.model,
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/modeling/model_loader/loader.py", line 217, in _get_weights_iterator
[rank0]:     hf_folder, hf_weights_files, use_safetensors = self._prepare_weights(
[rank0]:                                                    ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/kotokin/aphrodite-engine/aphrodite/modeling/model_loader/loader.py", line 207, in _prepare_weights
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Cannot find any model weights with `Kotokin/turboderp_Llama-3.1-8B-Instruct-exl2_6.0bpw`
djparente commented 1 month ago

I've been working on this same problem with a Llama-3.1 exl2 model against rc054. I think part of the problem is that the model.safetensors.index.json file is pointing at files that do not exist. For example, in the turboderp Llama 3.1 8B repository model.safetensors.index.json is calling for "model-00004-of-00004.safetensors" which doesn't exist in that exl2 repo (there is a single model shard called output.safetensors).

This problem also exists in other Llama 3.1 exl2 repositories (e.g., two 70B repositories LoneStriker/Meta-Llama-3.1-70B-Instruct-4.0bpw-h6-exl2 and turboderp/Llama-3.1-70B-Instruct-exl2)

I had claude write me a script to introspect into the safetensors file and generate a corrected model.safetensors.index.json file (see below). This gets rid of the error that it cannot find any model weights but there are further problems.:

First, it errors that "ValueError: torch.bfloat16 is not supported for quantization method exl2. Supported dtypes: [torch.float16]"

Suppose you try to fix this by explicitly setting the --dtype to float16 (I'm not sure if this is legitimate or not), then you run into a further error (also below). I looked at the code for a fair bit and tried to debug, but I think this perhaps is because the exl2 model is quantizing lm_head whereas the loader is not expecting this to be quantized?

So, possible actionable fixes for aphrodite:

It is possible I do not understand the issues in enough detail to have pinned this down. Any other information you have on this topic would be much appreciated.

Much appreciate your wonderful work on this project.

[rank0]:   File "/home/dan/.venvs/aphrodite-rc054-dev/bin/aphrodite", line 8, in <module>
[rank0]:     sys.exit(main())
[rank0]:              ^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/endpoints/cli.py", line 25, in main
[rank0]:     args.func(args)
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 542, in run_server
[rank0]:     engine = AsyncAphrodite.from_engine_args(engine_args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 391, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 346, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 466, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 203, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:                           ^^^^^^^^^^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 135, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/executor/distributed_gpu_executor.py", line 24, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/executor/executor_base.py", line 42, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 62, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/executor/multiproc_gpu_executor.py", line 113, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/task_handler/worker.py", line 124, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 138, in load_model
[rank0]:     self.model = get_model(
[rank0]:                  ^^^^^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/modeling/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/modeling/model_loader/loader.py", line 241, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/home/dan/aphrodite/rc054/aphrodite-engine/aphrodite/modeling/models/llama.py", line 430, in load_weights
[rank0]:     param = params_dict[name]
[rank0]:             ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'lm_head.q_groups'

Introspection script below to re-write the models.safetensors.index.json file from analysis of the safetensors files. (Not sure if the total size is being calculated correctly).

import json
from safetensors import safe_open
import os

def get_weights_in_file(filename):
    weights = {}
    with safe_open(filename, framework="pt", device="cpu") as f:
        for key in f.keys():
            tensor = f.get_tensor(key)
            weights[key] = {
                "shape": list(tensor.shape),
                "dtype": str(tensor.dtype)
            }
    return weights

def create_model_index(directory):
    weight_map = {}
    metadata = {"total_size": 0}

    for filename in os.listdir(directory):
        if filename.endswith('.safetensors'):
            full_path = os.path.join(directory, filename)
            file_size = os.path.getsize(full_path)
            metadata["total_size"] += file_size

            weights = get_weights_in_file(full_path)
            for weight_name in weights:
                weight_map[weight_name] = filename

    index = {
        "metadata": metadata,
        "weight_map": weight_map
    }

    return index

def write_model_index(directory, output_file="model.safetensors.index.json"):
    index = create_model_index(directory)

    output_path = os.path.join(directory, output_file)
    with open(output_path, 'w') as f:
        json.dump(index, f, indent=2)

    print(f"Created {output_file} with {len(index['weight_map'])} weights mapped.")

if __name__ == "__main__":
    import sys
    if len(sys.argv) < 2:
        print("Usage: python script_name.py <directory_path>")
        sys.exit(1)

    directory = sys.argv[1]
    write_model_index(directory)
AlpinDale commented 1 month ago

Exl2 is currently broken in the rc_054 branch. Please read the PR description at #481