[Bug]: - Githubissues

Your current environment

PyTorch version: 2.2.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect 
CMake version: Could not collect 
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S

Nvidia driver version: 545.29.06
cuDNN version: Could not collect 
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             256
On-line CPU(s) list:                0-255
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7702 64-Core Processor
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 64
Socket(s):                          2
Stepping:                           0
Frequency boost:                    enabled
CPU max MHz:                        2183.5930
CPU min MHz:                        1500.0000
BogoMIPS:                           3992.65
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                     AMD-V
L1d cache:                          4 MiB (128 instances)
L1i cache:                          4 MiB (128 instances)
L2 cache:                           64 MiB (128 instances)
L3 cache:                           512 MiB (32 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-63,128-191
NUMA node1 CPU(s):                  64-127,192-255
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.2.0
[pip3] torchaudio==2.2.0
[pip3] torchvision==0.17.0
[pip3] triton==2.2.0
[conda] Could not collect ROCM Version: Could not collect 
Aphrodite Version: 0.5.1
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

When using L40S on LLAMA-3-70b-Instruct, with a int8 bnb quant, the following error occurs:

ERROR:    Exception in callback _raise_exception_on_finish(error_callback=\<bound method...7fe0cfddb850\>\>)(\<Task 
finishe...sertions.\n')\>) at /usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py:25
ERROR:    handle: \<Handle _raise_exception_on_finish(error_callback=\<bound method...7fe0cfddb850\>\>)(\<Task 
finishe...sertions.\n')\>) at /usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py:25\>
ERROR:    Traceback (most recent call last):
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 33, in 
_raise_exception_on_finish
ERROR:        task.result()
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 468, in run_engine_loop
ERROR:        has_requests_in_progress = await asyncio.wait_for(
ERROR:      File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR:        return fut.result()
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 442, in engine_step
ERROR:        request_outputs = await self.engine.step_async()
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 208, in step_async
ERROR:        all_outputs = await self._run_workers_async(
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 293, in _run_workers_async
ERROR:        all_outputs = await asyncio.gather(*coros)
ERROR:      File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR:        result = self.fn(*self.args, **self.kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/worker.py", line 235, in execute_model
ERROR:        output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR:        return func(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/task_handler/model_runner.py", line 692, in execute_model
ERROR:        hidden_states = model_executable(
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR:        return self._call_impl(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR:        return forward_call(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 413, in forward
ERROR:        hidden_states = self.model(input_ids, positions, kv_caches,
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR:        return self._call_impl(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR:        return forward_call(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 340, in forward
ERROR:        hidden_states, residual = layer(
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR:        return self._call_impl(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR:        return forward_call(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 287, in forward
ERROR:        hidden_states = self.self_attn(
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR:        return self._call_impl(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR:        return forward_call(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/models/llama.py", line 219, in forward
ERROR:        qkv, _ = self.qkv_proj(hidden_states)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
ERROR:        return self._call_impl(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
ERROR:        return forward_call(*args, **kwargs)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/layers/linear.py", line 232, in forward
ERROR:        output_parallel = self.linear_method.apply_weights(
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/modeling/layers/quantization/bitsandbytes.py", line 203, in 
apply_weights
ERROR:        out = bnb.matmul(x, weight, bias=bias, state=state)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
ERROR:        return MatMul8bitLt.apply(A, B, out, bias, state)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 553, in apply
ERROR:        return super().apply(*args, **kwargs)  # type: ignore[misc]
ERROR:      File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 321, in forward
ERROR:        CA, CAt, SCA, SCAt, coo_tensorA = F.double_quant(A.to(torch.float16), threshold=state.threshold)
ERROR:    RuntimeError: CUDA error: invalid device function
ERROR:    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be 
incorrect.
ERROR:    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR:    
ERROR:    
ERROR:    The above exception was the direct cause of the following exception:
ERROR:    
ERROR:    Traceback (most recent call last):
ERROR:      File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
ERROR:        self._context.run(self._callback, *self._args)
ERROR:      File "/usr/local/lib/python3.10/dist-packages/aphrodite/engine/async_aphrodite.py", line 40, in 
_raise_exception_on_finish
ERROR:        raise AsyncEngineDeadError(
ERROR:    aphrodite.engine.async_aphrodite.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please 
open an issue on Github. See stack trace above for the actual cause.

Exact cmd used was

python -m aphrodite.endpoints.openai.api_server --model /workspace/hub/models--NousResearch--Meta-Llama-3-70B-Instruct/snapshots/7e1b5532f5f974e32703e6fb284cd0e06563ccbb -tp 2 --load-in-8bit -gmu 0.97

Please advise.

PygmalionAI / aphrodite-engine

[Bug]: #435

Your current environment

🐛 Describe the bug