Question about configurations of runtime arguments

sleepwalker2017 commented 5 months ago

I'm benchmarking vicuna 13B using trt-llm v0.9.0 on 2*A30 GPU, and try the following configurations.

I think there are some strange points:

Enabling prefix caching alone has a positive impact on performance, while other configurations or combinations generally have negative effects.
When prefix caching and chunked prefill are enabled together, an error occurs during execution, which seems to be a bug.
The highest performance is achieved with prefix caching combined with preemption, yet enabling preemption alone has negative effects, which is quite strange.

kaiyux commented 5 months ago

@sleepwalker2017 Thanks for reporting the issues. Is it possible to provide more commands and steps to reproduce the issue, especially the 2nd point?

sleepwalker2017 commented 5 months ago

branch main，commit id：66ef1df492f7bc9c8eeb01d7e14db01838e3f0bd

model=/data/vicuna-13b/vicuna-13b-v1.5/
tp=2
python convert_checkpoint.py --model_dir ${model} \
                              --output_dir ./tllm_checkpoint_2gpu_fp16 \
                              --dtype float16 --tp_size ${tp}

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp16 \
            --output_dir ./tmp/llama/13B/trt_engines/fp16/2-gpu \
            --gemm_plugin float16 \
            --use_fused_mlp \
            --max_batch_size 24 \
            --max_input_len 2048 \
            --max_output_len 256 \
            --context_fmha enable \
            --paged_kv_cache enable \
            --use_paged_context_fmha enable \
            --remove_input_padding enable  --workers ${tp} \
            --use_fused_mlp

mpirun -n 2 --allow-run-as-root ./gptManagerBenchmark --engine_dir ../../../examples/llama/tmp/llama/13B/trt_engines/fp16/2-gpu/ --dataset ../../../benchmarks/cpp/token-norm-dist.json --kv_cache_free_gpu_mem_fraction 0.85

You can generate input tokens using your scripts locally. @kaiyux

geraldstanje commented 2 months ago

whats the flag to enable prefix caching ?

NVIDIA / TensorRT-LLM

Question about configurations of runtime arguments #1489