Open Hukongtao opened 5 months ago
Can TRT-LLM support use_cache=False
? Like transformers:model.generate(use_cache=False)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
System Info
CPU:x86-64 GPU: a100 tensorrt-llm: 0.11.0.dev2024051400
Who can help?
@ncomly-nvidia @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Problem background:
I want to use TRT-LLM to optimize the Qwen-32B-GPTQ-4bit model, and my output token number is only 1. In order to save memory, I want to set use_cache=Fasle. Like this:
https://github.com/NVIDIA/TensorRT-LLM/blob/5d8ca2faf74c494f220c8f71130340b513eea9a9/tensorrt_llm/models/modeling_utils.py#L601
Then I run
But I got:
Expected behavior
run successfully
actual behavior
error
additional notes
no