Open sleepwalker2017 opened 2 months ago
I am getting the same issue when trying speculative decoding (medusa) with vicuna, after some inference, it is getting buffer size exceeds 2560
Encountered an issue while using speculative decoding: '[TensorRT LM] [ERROR] Encountered an error in forward function: slice 501760 excesses buffer size 250880', 0.9.0 dev20240222000 is normal
Hi, thanks for reporting this issue. I haven't been able to reproduce on latest main
on 2xA100. What --max_batch_size
value did you use (it's not specified in the build cmd you shared)? Thanks.
I also just tested on 2xA30 and cannot reproduce using latest main
following the instructions shared above.
mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse
[BENCHMARK] num_samples 100
[BENCHMARK] num_error_samples 0
[BENCHMARK] num_samples 100
[BENCHMARK] total_latency(ms) 1506.20
[BENCHMARK] seq_throughput(seq/sec) 66.39
[BENCHMARK] token_throughput(token/sec) 995.88
[BENCHMARK] avg_sequence_latency(ms) 1116.72
[BENCHMARK] max_sequence_latency(ms) 1501.60
[BENCHMARK] min_sequence_latency(ms) 872.77
[BENCHMARK] p99_sequence_latency(ms) 1501.60
[BENCHMARK] p90_sequence_latency(ms) 1501.58
[BENCHMARK] p50_sequence_latency(ms) 900.98
mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse -enable_chunked_context
mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85 --enable_kv_cache_reuse -enable_chunked_context
hi, this issue is reproduced by using --enable_kv_cache_reuse
and -enable_chunked_context
together.
I built it using max_batch = 24.
System Info
GPU A30 * 2
TensorRT-LLM version: v0.9.0
Model: vicuna 13B
Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
No error message.
actual behavior
additional notes
no