PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S
Nvidia driver version: 545.29.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7702 64-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 2183.5930
CPU min MHz: 1500.0000
BogoMIPS: 3992.65
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization: AMD-V
L1d cache: 4 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 64 MiB (128 instances)
L3 cache: 512 MiB (32 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.2.0
[pip3] torchaudio==2.2.0
[pip3] torchvision==0.17.0
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect
Aphrodite Version: 0.5.1
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
🐛 Describe the bug
When using L40S on LLAMA-3-70b-Instruct, with a int8 bnb quant, the following error occurs:
ERROR: Exception in callback _raise_exception_on_finish(error_callback=\<bound method...7fe0cfddb850\>\>)(\<Task
finishe...sertions.\n')\>) at /usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py:25
ERROR: handle: \<Handle _raise_exception_on_finish(error_callback=\<bound method...7fe0cfddb850\>\>)(\<Task
finishe...sertions.\n')\>) at /usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py:25\>
ERROR: Traceback (most recent call last):
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 33, in
_raise_exception_on_finish
ERROR: task.result()
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 468, in run_engine_loop
ERROR: has_requests_in_progress = await asyncio.wait_for(
ERROR: File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR: return fut.result()
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 442, in engine_step
ERROR: request_outputs = await self.engine.step_async()
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 208, in step_async
ERROR: all_outputs = await self._run_workers_async(
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 293, in _run_workers_async
ERROR: all_outputs = await asyncio.gather(*coros)
ERROR: File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR: result = self.fn(*self.args, **self.kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR: return func(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/worker.py", line 235, in execute_model
ERROR: output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR: return func(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/model_runner.py", line 692, in execute_model
ERROR: hidden_states = model_executable(
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR: return self._call_impl(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR: return forward_call(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 413, in forward
ERROR: hidden_states = self.model(input_ids, positions, kv_caches,
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR: return self._call_impl(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR: return forward_call(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 340, in forward
ERROR: hidden_states, residual = layer(
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR: return self._call_impl(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR: return forward_call(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 287, in forward
ERROR: hidden_states = self.self_attn(
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR: return self._call_impl(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR: return forward_call(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 219, in forward
ERROR: qkv, _ = self.qkv_proj(hidden_states)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR: return self._call_impl(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR: return forward_call(*args, **kwargs)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/layers/linear.py", line 232, in forward
ERROR: output_parallel = self.linear_method.apply_weights(
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/layers/quantization/bitsandbytes.py", line 203, in
apply_weights
ERROR: out = bnb.matmul(x, weight, bias=bias, state=state)
ERROR: File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
ERROR: return MatMul8bitLt.apply(A, B, out, bias, state)
ERROR: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
ERROR: return super().apply(*args, **kwargs) # type: ignore[misc]
ERROR: File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 321, in forward
ERROR: CA, CAt, SCA, SCAt, coo_tensorA = F.double_quant(A.to(torch.float16), threshold=state.threshold)
ERROR: RuntimeError: CUDA error: invalid device function
ERROR: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be
incorrect.
ERROR: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR:
ERROR:
ERROR: The above exception was the direct cause of the following exception:
ERROR:
ERROR: Traceback (most recent call last):
ERROR: File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
ERROR: self._context.run(self._callback, *self._args)
ERROR: File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 40, in
_raise_exception_on_finish
ERROR: raise AsyncEngineDeadError(
ERROR: aphrodite.engine.async_aphrodite.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please
open an issue on Github. See stack trace above for the actual cause.
Your current environment
🐛 Describe the bug
When using L40S on LLAMA-3-70b-Instruct, with a int8 bnb quant, the following error occurs:
Exact cmd used was
python -m aphrodite.endpoints.openai.api_server --model /workspace/hub/models--NousResearch--Meta-Llama-3-70B-Instruct/snapshots/7e1b5532f5f974e32703e6fb284cd0e06563ccbb -tp 2 --load-in-8bit -gmu 0.97
Please advise.