Closed ljayx closed 3 weeks ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.
Hi,
I'm testing llama3-70b model with smoothquant on a 4 x RTX-4090 GPUs node. Due to the memory restriction, I used
host_cache_size
parameter for offloading kv cache to host. Then I hit 2 issues:1. From the logs, seems this config doesn't take effect.
Log snippet:
2. In this situation, When I keep pushing inference requests, the service crashed after a little while.
crash msg:
Could you help check it? Scripts for converting & building & serving listed as below.
model convert:
model build:
LLM instance:
TRT-LLM full logs: