PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
926 stars 100 forks source link

[Bug]: loading model with int8 kv cache chokes #346

Closed BlairSadewitz closed 1 week ago

BlairSadewitz commented 5 months ago

Your current environment

PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (conda-forge gcc 11.3.0-19) 11.3.0
Clang version: Could not collect 
CMake version: version 3.27.6
Libc version: glibc-2.35
Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000

Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      40 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             64
On-line CPU(s) list:                0-63
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC Processor
CPU family:                         23
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          1
Stepping:                           2
BogoMIPS:                           4890.76
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt nrip_save
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1 MiB (32 instances)
L1i cache:                          2 MiB (32 instances)
L2 cache:                           16 MiB (32 instances)
L3 cache:                           64 MiB (8 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.0
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: 0.5.2
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$ python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --kv-cache-dtype int8

2024-03-19 19:52:14,449 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 19:52:14,450 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 19:52:14,649 INFO worker.py:1724 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 2
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = None
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = int8
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 599, in <module>
    engine = AsyncAphrodite.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 676, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 341, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 410, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 113, in __init__
    self._init_workers_ray(placement_group)
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 268, in _init_workers_ray
    self.driver_worker = Worker(
                         ^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
    self.kv_quant_params = (self.load_kv_quant_params(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
    kv_quant_params.append(kv_quant_param)
                           ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value
2024-03-19 19:52:19,750 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::RayWorkerAphrodite.init_worker() (pid=26429, ip=172.17.0.2, actor_id=537d7fe532ba3d411a06c1f001000000, repr=<aphrodite.engine.ray_tools.RayWorkerAphrodite object at 0x7f34058b5b50>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/ray_tools.py", line 22, in init_worker
    self.worker = worker_init_fn()
                  ^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 252, in <lambda>
    lambda rank=rank, local_rank=local_rank: Worker(
                                             ^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/worker.py", line 60, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 92, in __init__
    self.kv_quant_params = (self.load_kv_quant_params(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 116, in load_kv_quant_params
    kv_quant_params.append(kv_quant_param)
                           ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'kv_quant_param' where it is not associated with a value

Well, it turns out that I didn't have enough VRAM to load the model in 16-bit, but I just tried it with --load-in-4bit, and failure's the same. Without the int8 kv_cache, model loads fine:

(aphrodite-runtime) root@C.10151121:~/aphrodite-engine$  python -m aphrodite.endpoints.openai.api_server -tp 2 --model ParasiticRogue/Merged-Vicuna-RP-Stew-34B --load-in-4bit 
WARNING:  bnb quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-03-19 20:03:18,803 WARNING utils.py:575 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
2024-03-19 20:03:18,804 WARNING utils.py:587 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 30.71999 to 30.
2024-03-19 20:03:18,984 INFO worker.py:1724 -- Started a local Ray instance.
INFO:     Initializing the Aphrodite Engine (v0.5.2) with the following config:
INFO:     Model = 'ParasiticRogue/Merged-Vicuna-RP-Stew-34B'
INFO:     DataType = torch.bfloat16
INFO:     Model Load Format = auto
INFO:     Number of GPUs = 2
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = bnb
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     KV Cache Data Type = auto
INFO:     KV Cache Params Path = None
INFO:     Device = cuda
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING:  Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerAphrodite pid=36344) WARNING:  Custom allreduce is disabled because your platform lacks GPU P2P capability. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO:     Downloading model weights ['*.safetensors']
(RayWorkerAphrodite pid=36344) INFO:     Downloading model weights ['*.safetensors']

INFO:     Memory allocated for converted model: 9.17 GiB
INFO:     Memory reserved for converted model: 9.26 GiB
INFO:     Model weights loaded. Memory usage: 9.17 GiB x 2 = 18.34 GiB

with kv-cache-dtype=fp8_e5m2 and load-in-4bit, it works also.

BlairSadewitz commented 5 months ago

Oops, nevermind. I didn't read the documentation. Sorry, lol. You might want to put that in boldface or something on the main page where you mention it.