-
Hi,
Thanks for the great work!
I'm trying to understand the triforce method, but confused about the middle speculation.
1. Dose the target model with retrieval-based KV cache need to verify after e…
-
### Motivation
For current large model inference, KV cache occupies a significant portion of GPU memory, so reducing the size of KV cache is an important direction for improvement. Recently, severa…
-
**Environment**
CPU architecture: x86_64
CPU/Host memory size: 32G
GPU properties: SM86
GPU name: NVIDIA A10
GPU memory size: 24G
Clock frequencies used: 1695MHz
**Libraries**
TensorRT-LLM: v…
-
Hi @Guangxuan-Xiao, do you have any comparison with sliding window attention from Mistral? The paper only describes SWA with re-computation which is not how it works in the new models.
> Sliding W…
-
### Anything you want to discuss about vllm.
As a beginner, there are too many issues and PRs, and I find it hard to start contributing.
Could anyone please add `good first issue` label to some is…
-
### System Info
EC2 instance: G5.48xl
Nvidia driver: 535.161.08
Cuda: 12.2
commit 5d8ca2faf74c494f220c8f71130340b513eea9a9
Torch: 2.3.0
### Who can help?
@byshiue running into the issue with h…
-
Congrats for the nice work.
I see streamingLLM and InfiniteLLM are used in your experiments.
Have you developled your own implementation for `stream` and `Infinite`? The original streamingLLM is…
-
### Feature request
Implementations:
https://github.com/mit-han-lab/streaming-llm/tree/main
https://github.com/tomaarsen/attention_sinks/tree/main
Paper:
https://arxiv.org/abs/2309.17453
A…
Bec-k updated
2 months ago
-
### 起始日期 | Start Date
_No response_
### 实现PR | Implementation PR
I'm opening this issue here so that we can track progress on the long-context extension with minimal VRAM requirements. Many users h…
-
### System Info
- Host: VMware ESXi 7
- Host Nvidia drivers: 550.54.16
- VM CPU architecture: x86_64
- VM Nvidia drivers: 550.54.15
- VM OS: Ubuntu LTS 22.04
- Physical GPU: A100
- TensorRT-LLM…