PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
606 stars 78 forks source link

[Bug]: Mixtral-8x22b-instruct not running with AWQ #421

Closed SalomonKisters closed 3 weeks ago

SalomonKisters commented 4 weeks ago

Your current environment

[ssh_tmux]0:python3*                                                 "root@C.10595694: /app" 15:52 18-Apr-24(B[?12l[?25hCPU(s):                             56
On-line CPU(s) list:                0-55
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
CPU family:                         6
Model:                              79
Thread(s) per core:                 2
Core(s) per socket:                 14
Socket(s):                          2
Stepping:                           1
CPU max MHz:                        3300.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           4800.24
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
Virtualization:                     VT-x
L1d cache:                          896 KiB (28 instances)
L1i cache:                          896 KiB (28 instances)
L2 cache:                           7 MiB (28 instances)
L3 cache:                           70 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-13,28-41
NUMA node1 CPU(s):                  14-27,42-55
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerableVulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT vulnerable
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.4
[pip3] onnxruntime==1.17.3
[pip3] torch==2.2.2
[pip3] torchvision==0.17.2
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: N/A
Aphrodite Build Flags:
CUDA Archs: 6.1 7.0 7.5 8.0 8.6 8.9 9.0+PTX; ROCm: Disabled

🐛 Describe the bug

Description

Complete error message:

File "/aphrodite/task_handler/model_runner.py", line 765, in profile_run self.execute_model(seqs, kv_caches) File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/aphrodite/task_handler/model_runner.py", line 700, in execute_model hidden_states = model_executable( File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 414, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 381, in forward hidden_states, residual = layer(positions, hidden_states, File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 344, in forward hidden_states = self.block_sparse_moe(hidden_states) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 154, in forward batch_size, sequence_length, hidden_dim = hidden_states.shape ValueError: not enough values to unpack (expected 3, got 2) (RayWorkerAphrodite pid=37067) ERROR: Error executing method profile_num_available_blocks. This might cause (RayWorkerAphrodite pid=37067) deadlock in distributed execution. (RayWorkerAphrodite pid=37260) INFO: Model weights loaded. Memory usage: 17.20 GiB x 4 = 68.80 GiB [repeated 2x across cluster] (RayWorkerAphrodite pid=37260) ERROR: Error executing method profile_num_available_blocks. This might cause [repeated 2x across cluster] (RayWorkerAphrodite pid=37260) deadlock in distributed execution. [repeated 2x across cluster]

Thank you for your help!

SalomonKisters commented 3 weeks ago

Just got another error this time using WizardLM-2-8x22b (mixtral finetune): INFO: Model weights loaded. Memory usage: 17.20 GiB x 4 = 68.78 GiB Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/aphrodite/endpoints/openai/api_server.py", line 621, in engine = AsyncAphrodite.from_engine_args(engine_args) File "/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/aphrodite/engine/async_aphrodite.py", line 313, in init self.engine = self._init_engine(*args, kwargs) File "/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine return engine_class(*args, *kwargs) File "/aphrodite/engine/aphrodite_engine.py", line 111, in init self.model_executor = executor_class(model_config, cache_config, File "/aphrodite/executor/ray_gpu_executor.py", line 71, in init self._init_cache() File "/aphrodite/executor/ray_gpu_executor.py", line 237, in _init_cache num_blocks = self._run_workers( File "/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/aphrodite/task_handler/worker.py", line 132, in profile_num_available_blocks self.model_runner.profile_run() File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/aphrodite/task_handler/model_runner.py", line 765, in profile_run self.execute_model(seqs, kv_caches) File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/aphrodite/task_handler/model_runner.py", line 700, in execute_model hidden_states = model_executable( File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 414, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 381, in forward hidden_states, residual = layer(positions, hidden_states, File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 344, in forward hidden_states = self.block_sparse_moe(hidden_states) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/aphrodite/modeling/models/mixtral_quant.py", line 154, in forward batch_size, sequence_length, hidden_dim = hidden_states.shape ValueError: not enough values to unpack (expected 3, got 2) (RayWorkerAphrodite pid=48188) ERROR: Error executing method profile_num_available_blocks. This might cause (RayWorkerAphrodite pid=48188) deadlock in distributed execution. (RayWorkerAphrodite pid=48259) INFO: Model weights loaded. Memory usage: 17.20 GiB x 4 = 68.78 GiB [repeated 2x across cluster] (RayWorkerAphrodite pid=48064) ERROR: Error executing method profile_num_available_blocks. This might cause [repeated 2x across cluster] (RayWorkerAphrodite pid=48064) deadlock in distributed execution. [repeated 2x across cluster]

sgsdxzy commented 3 weeks ago

Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?

SalomonKisters commented 3 weeks ago

Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?

Main branch, but building from source. Ill try using dev and update you on the results :) Might take a while though, as the build from source taks like an hour for me for some reason

SalomonKisters commented 3 weeks ago

Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?

I tried to do it with dev, but a library that is used by aphrodite was not installed and so it failed. This is the code I am using in my docker image:

(I tried with git checkout dev as well)

RUN git clone https://github.com/PygmalionAI/aphrodite-engine.git /tmp/aphrodite-engine && \ mv /tmp/aphrodite-engine/* . && \ rm -rf /tmp/aphrodite-engine && \ chmod +x docker/entrypoint.sh && \ python3 -m venv $APHRODITE_VENV && \ /bin/bash -c "source $APHRODITE_VENV/bin/activate && \ python3 -m pip install --no-cache-dir -e . && \ deactivate"

This is the error trace installing dev from source, when I try to start the engine up: INFO: Device = cuda Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/app/aphrodite-engine/aphrodite/modeling/megatron/cupy_utils.py", line 14, in import cupy ModuleNotFoundError: No module named 'cupy'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 619, in engine = AsyncAphrodite.from_engine_args(engine_args) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in init self.engine = self._init_engine(*args, *kwargs) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine return engine_class(args, **kwargs) File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in init self.model_executor = executor_class(model_config, cache_config, File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 68, in init self._init_workers_ray(placement_group) File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 205, in _init_workers_ray self._run_workers("init_model", File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 102, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 286, in init_distributed_environment cupy_utils.init_process_group( File "/app/aphrodite-engine/aphrodite/modeling/megatron/cupy_utils.py", line 76, in init_process_group raise ImportError( ImportError: NCCLBackend is not available. Please install cupy. (RayWorkerAphrodite pid=2900) ERROR: Error executing method init_model. This might cause deadlock in (RayWorkerAphrodite pid=2900) distributed execution. (RayWorkerAphrodite pid=3073) ERROR: Error executing method init_model. This might cause deadlock in [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerAphrodite pid=3073) distributed execution. [repeated 2x across cluster]

SalomonKisters commented 3 weeks ago

Are you using the main branch or dev branch? can you try on the dev branch to see if the problem persists?

I tried to do it with dev, but a library that is used by aphrodite was not installed and so it failed. This is the code I am using in my docker image:

(I tried with git checkout dev as well)

RUN git clone https://github.com/PygmalionAI/aphrodite-engine.git /tmp/aphrodite-engine && mv /tmp/aphrodite-engine/* . && rm -rf /tmp/aphrodite-engine && chmod +x docker/entrypoint.sh && python3 -m venv $APHRODITE_VENV && /bin/bash -c "source $APHRODITE_VENV/bin/activate && python3 -m pip install --no-cache-dir -e . && deactivate"

This is the error trace installing dev from source, when I try to start the engine up: INFO: Device = cuda Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/app/aphrodite-engine/aphrodite/modeling/megatron/cupy_utils.py", line 14, in import cupy ModuleNotFoundError: No module named 'cupy'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 619, in engine = AsyncAphrodite.from_engine_args(engine_args) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in init self.engine = self._init_engine(*args, *kwargs) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine return engine_class(args, kwargs) File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in init self.model_executor = executor_class(model_config, cache_config, File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 68, in init** self._init_workers_ray(placement_group) File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 205, in _init_workers_ray self._run_workers("init_model", File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 102, in init_model init_distributed_environment(self.parallel_config, self.rank, File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 286, in init_distributed_environment cupy_utils.init_process_group( File "/app/aphrodite-engine/aphrodite/modeling/megatron/cupy_utils.py", line 76, in init_process_group raise ImportError( ImportError: NCCLBackend is not available. Please install cupy. (RayWorkerAphrodite pid=2900) ERROR: Error executing method init_model. This might cause deadlock in (RayWorkerAphrodite pid=2900) distributed execution. (RayWorkerAphrodite pid=3073) ERROR: Error executing method init_model. This might cause deadlock in [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (RayWorkerAphrodite pid=3073) distributed execution. [repeated 2x across cluster]

Update: using this - pip install cupy-cuda12x in the instance seemed to fix it, will retry later

SalomonKisters commented 3 weeks ago

I retried with the fix and it merely changed the issue:

WARNING: Admin key not provided. Admin operations will be disabled. WARNING: awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO: Using fp8_e5m2 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. But it may cause slight accuracy drop. Currently we only support fp8 without scaling factors and use e5m2 as a default format. 2024-04-25 12:08:54,673 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning. 2024-04-25 12:08:54,674 WARNING utils.py:592 -- Ray currently does not support initializing Ray with fractional cpus. Your num_cpus will be truncated from 26.87999 to 26. 2024-04-25 12:08:54,846 INFO worker.py:1749 -- Started a local Ray instance. INFO: Initializing the Aphrodite Engine (v0.5.2) with the following config: INFO: Model = '/app/casperhansen_llama-3-8b-instruct-awq-main' INFO: DataType = torch.float16 INFO: Model Load Format = auto INFO: Number of GPUs = 1 INFO: Disable Custom All-Reduce = False INFO: Quantization Format = awq INFO: Context Length = 8000 INFO: Enforce Eager Mode = False INFO: KV Cache Data Type = fp8_e5m2 INFO: Device = cuda Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO: flash_attn is not found. Using xformers backend. INFO: Model weights loaded. Memory usage: 5.34 GiB x 1 = 5.34 GiB INFO: # GPU blocks: 61048, # CPU blocks: 4096 INFO: Minimum concurrency: 122.10x INFO: Maximum sequence length allowed in the cache: 976768 INFO: Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. WARNING: CUDA graphs can take additional 1~3 GiB of memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. Capturing graph... 0% 0/35 -:--:-- Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/app/aphrodite-engine/aphrodite/endpoints/openai/api_server.py", line 619, in engine = AsyncAphrodite.from_engine_args(engine_args) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 342, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 313, in init self.engine = self._init_engine(*args, kwargs) File "/app/aphrodite-engine/aphrodite/engine/async_aphrodite.py", line 413, in _init_engine return engine_class(*args, *kwargs) File "/app/aphrodite-engine/aphrodite/engine/aphrodite_engine.py", line 111, in init self.model_executor = executor_class(model_config, cache_config, File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 71, in init self._init_cache() File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 270, in _init_cache self._run_workers("warm_up_model") File "/app/aphrodite-engine/aphrodite/executor/ray_gpu_executor.py", line 341, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/app/aphrodite-engine/aphrodite/task_handler/worker.py", line 165, in warm_up_model self.model_runner.capture_model(self.gpu_cache) File "/opt/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 879, in capture_model graph_runner.capture( File "/app/aphrodite-engine/aphrodite/task_handler/model_runner.py", line 929, in capture self.model( File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/app/aphrodite-engine/aphrodite/modeling/models/llama.py", line 426, in forward hidden_states = self.model(input_ids, positions, kv_caches, File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, kwargs) File "/app/aphrodite-engine/aphrodite/modeling/models/llama.py", line 351, in forward hidden_states, residual = layer( File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/app/aphrodite-engine/aphrodite/modeling/models/llama.py", line 298, in forward hidden_states = self.self_attn( File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/app/aphrodite-engine/aphrodite/modeling/models/llama.py", line 228, in forward attn_output = self.attn( File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/app/aphrodite-engine/aphrodite/modeling/layers/attention/init.py", line 67, in forward return self.backend.forward(query, key, value, key_cache, value_cache, File "/app/aphrodite-engine/aphrodite/modeling/layers/attention/backends/xformers.py", line 94, in forward PagedAttentionImpl.reshape_and_cache(key, value, key_cache, File "/app/aphrodite-engine/aphrodite/modeling/layers/attention/ops/paged_attn.py", line 29, in reshape_and_cache cache_ops.reshape_and_cache( TypeError: reshape_and_cache(): incompatible function arguments. The following argument types are supported:

  1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: str, arg6: float) -> None

Invoked with: tensor([[[ 3.0820, 2.7754, 2.5762, ..., 0.1114, -0.8169, -0.8433], [-0.1935, -0.5439, 0.4592, ..., 1.0117, -1.0430, -0.5796], [-3.0254, 5.1680, 1.0449, ..., -0.2295, -0.6177, 1.9072], ..., [-0.3259, 0.5840, 0.2529, ..., 1.0391, -0.6431, -0.9507], [-0.2727, -1.0410, 0.4785, ..., -1.1104, 0.7637, -0.9116], [-0.1769, -0.1261, -0.3979, ..., -0.0826, -1.0986, -2.8613]],

    [[ 3.0820,  2.7754,  2.5762,  ...,  0.1114, -0.8169, -0.8433],
     [-0.1935, -0.5439,  0.4592,  ...,  1.0117, -1.0430, -0.5796],
     [-3.0254,  5.1680,  1.0449,  ..., -0.2295, -0.6177,  1.9072],
     ...,
     [-0.3259,  0.5840,  0.2529,  ...,  1.0391, -0.6431, -0.9507],
     [-0.2727, -1.0410,  0.4785,  ..., -1.1104,  0.7637, -0.9116],
     [-0.1769, -0.1261, -0.3979,  ..., -0.0826, -1.0986, -2.8613]],

    [[ 3.0820,  2.7754,  2.5762,  ...,  0.1114, -0.8169, -0.8433],
     [-0.1935, -0.5439,  0.4592,  ...,  1.0117, -1.0430, -0.5796],
     [-3.0254,  5.1680,  1.0449,  ..., -0.2295, -0.6177,  1.9072],
     ...,
     [-0.3259,  0.5840,  0.2529,  ...,  1.0391, -0.6431, -0.9507],
     [-0.2727, -1.0410,  0.4785,  ..., -1.1104,  0.7637, -0.9116],
     [-0.1769, -0.1261, -0.3979,  ..., -0.0826, -1.0986, -2.8613]],

    ...,

    [[ 3.0820,  2.7754,  2.5762,  ...,  0.1114, -0.8169, -0.8433],
     [-0.1935, -0.5439,  0.4592,  ...,  1.0117, -1.0430, -0.5796],
     [-3.0254,  5.1680,  1.0449,  ..., -0.2295, -0.6177,  1.9072],
     ...,
     [-0.3259,  0.5840,  0.2529,  ...,  1.0391, -0.6431, -0.9507],
     [-0.2727, -1.0410,  0.4785,  ..., -1.1104,  0.7637, -0.9116],
     [-0.1769, -0.1261, -0.3979,  ..., -0.0826, -1.0986, -2.8613]],

    [[ 3.0820,  2.7754,  2.5762,  ...,  0.1114, -0.8169, -0.8433],
     [-0.1935, -0.5439,  0.4592,  ...,  1.0117, -1.0430, -0.5796],
     [-3.0254,  5.1680,  1.0449,  ..., -0.2295, -0.6177,  1.9072],
     ...,
     [-0.3259,  0.5840,  0.2529,  ...,  1.0391, -0.6431, -0.9507],
     [-0.2727, -1.0410,  0.4785,  ..., -1.1104,  0.7637, -0.9116],
     [-0.1769, -0.1261, -0.3979,  ..., -0.0826, -1.0986, -2.8613]],

    [[ 3.0820,  2.7754,  2.5762,  ...,  0.1114, -0.8169, -0.8433],
     [-0.1935, -0.5439,  0.4592,  ...,  1.0117, -1.0430, -0.5796],
     [-3.0254,  5.1680,  1.0449,  ..., -0.2295, -0.6177,  1.9072],
     ...,
     [-0.3259,  0.5840,  0.2529,  ...,  1.0391, -0.6431, -0.9507],
     [-0.2727, -1.0410,  0.4785,  ..., -1.1104,  0.7637, -0.9116],
     [-0.1769, -0.1261, -0.3979,  ..., -0.0826, -1.0986, -2.8613]]],
   device='cuda:0', dtype=torch.float16), tensor([[[-0.0085, -0.0052,  0.0432,  ...,  0.0155, -0.0157,  0.0518],
sgsdxzy commented 3 weeks ago

Did you recompile (execute pip install -e .) aphrodite after switching to dev? You need to recompile it.

SalomonKisters commented 3 weeks ago

Did you recompile (execute pip install -e .) aphrodite after switching to dev? You need to recompile it.

Yeah I am building a docker image, and I rebuilt it entirely, also incorporating the fix I found. It also still works on main.

This is the code, using nvidia/cuda:12.1.1-devel-ubuntu22.04 as base:

` RUN git clone https://github.com/PygmalionAI/aphrodite-engine.git /app/aphrodite-engine && \ cd /app/aphrodite-engine && \ git checkout dev && \ chmod +x docker/entrypoint.sh && \ python3 -m venv $APHRODITE_VENV && \ /bin/bash -c "source $APHRODITE_VENV/bin/activate && \ python3 -m pip install --no-cache-dir -e . && \ deactivate"

RUN /bin/bash -c "source $APHRODITE_VENV/bin/activate && \ pip install cupy-cuda12x && \ deactivate" `

Thank you for your help

SalomonKisters commented 3 weeks ago

Found the issue in this line: --kv-cache-dtype "fp8_e5m2" \ This is no longer supported and replaced by fp8 on dev

SalomonKisters commented 3 weeks ago

Okay, I finally got it running, thanks for the help