PygmalionAI / aphrodite-engine

Large-scale LLM inference engine
https://aphrodite.pygmalion.chat
GNU Affero General Public License v3.0
910 stars 99 forks source link

[Bug]: Using SillyTavern slow down aphrodite generation speed. #523

Open thatname opened 2 months ago

thatname commented 2 months ago

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 531.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             20
On-line CPU(s) list:                0-19
Vendor ID:                          GenuineIntel
Model name:                         13th Gen Intel(R) Core(TM) i5-13600K
CPU family:                         6
Model:                              183
Thread(s) per core:                 2
Core(s) per socket:                 10
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           6988.79
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                     VT-x
Hypervisor vendor:                  Microsoft
Virtualization type:                full
L1d cache:                          480 KiB (10 instances)
L1i cache:                          320 KiB (10 instances)
L2 cache:                           20 MiB (10 instances)
L3 cache:                           24 MiB (1 instance)
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] rotary-embedding-torch==0.5.3
[pip3] torch==2.3.0+cu121
[pip3] torchaudio==2.3.0+cu121
[pip3] torchvision==0.18.0+cu121
[pip3] triton==2.3.0
[conda] Could not collect ROCM Version: Could not collect
Aphrodite Version: 0.5.3
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled

🐛 Describe the bug

I have a 3090 and a 4090, aphrodite launched in a WSL CLI:

aphrodite run Qwen2-72B-Instruct-exl2 -tp 2 --gpu-memory-utilization 1 --kv-cache-dtype fp8 --max-context-len-to-capture 8192 --max-model-len 8192 -q exl2

First I used curl to test the server, the generation speed was 14.0 tok/s at average, then, I use SillyTarven to test, the generation speed decreased to 4.0 tok/s, and stay at 4.0 tok/s even swich back to curl again.

Log:

INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.9 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.1%, CPU KV cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.4 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.3%, CPU KV cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.0 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.5%, CPU KV cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.9 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.6%, CPU KV cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.7 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 2.9%, CPU KV cache usage: 0.0%
INFO:     Finished request cmpl-ffc35d72f97f4cc188284a294a2df830-0.
INFO:     127.0.0.1:45306 - "POST /v1/completions HTTP/1.1" 200
INFO:     127.0.0.1:59962 - "POST /v1/tokenize HTTP/1.1" 200
INFO:     127.0.0.1:59962 - "POST /v1/tokenize HTTP/1.1" 200
INFO:     127.0.0.1:59962 - "POST /v1/tokenize HTTP/1.1" 200
INFO:     Received request cmpl-52ca7796d07a41c29cefb1b975e87005-0: prompt: '', sampling_params:
SamplingParams(repetition_penalty=1.1, temperature=0.5, top_p=0.9, mirostat_mode=2, mirostat_tau=5.0, mirostat_eta=0.1,
stop=['\nUser:'], max_tokens=2048), lora_request: None.
INFO:     127.0.0.1:59970 - "POST /v1/completions HTTP/1.1" 200
INFO:     Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.7%, CPU KV cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.1 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.8%, CPU KV cache usage: 0.0%
INFO:     Finished request cmpl-52ca7796d07a41c29cefb1b975e87005-0.
INFO:     Received request cmpl-df21228374bf4f84b4ede0207c2fda19-0: prompt: '', sampling_params:
SamplingParams(mirostat_mode=2, mirostat_tau=6.5, mirostat_eta=0.2, max_tokens=1024), lora_request: None.
INFO:     127.0.0.1:36398 - "POST /v1/completions HTTP/1.1" 200
INFO:     Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO:     Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 1 reqs, Swapped: 0
reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
AlpinDale commented 1 month ago

That's really weird, I haven't noticed this before. @sgsdxzy any ideas? I can't think of a reason why this would even happen.

sgsdxzy commented 1 month ago

Is it related to mirostat? now that we removed miro it's no longer a problem...

AlpinDale commented 4 days ago

Is this still an issue?