-
### System Info
Google Colab with GPU T4 and CUDA 12.2.
TensorRT-LLM version: 0.9.0.dev2024040200.
Here is the [minimum reproducible notebook](https://colab.research.google.com/drive/1xAxZKYHx_Qq4g…
-
I would like to express my gratitude for your paper and code, which have been truly enlightening for me. I conducted the experiments following the instructions provided in the README. I would be grate…
-
Hi,
Thanks for the amazing work on streaming-llm. While reading the paper, I came up with this question on why applying "attention sink" also works with models with alibi position embedding.
One o…
-
Firstly, I'd like to express my appreciation for your insightful paper and the open-source 'streaming-llm'. Your approach and experiments are truly commendable. I hope you don't mind, I would really a…
-
Nice work!
I am wondering whether this attention sink magic is still needed for LLMs that has been already trained with window attention (e.g. [mistral](https://github.com/mistralai/mistral-src)). …
-
### Have you searched for similar requests?
Yes
### Is your feature request related to a problem? If so, please describe.
llama.cpp has the feature to re-use the context window if the beginning of …
-
### 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
### 该问题是否在FAQ中有解答? | Is there an existing…
-
I tried naively to add examples in https://github.com/mit-han-lab/streaming-llm/blob/main/data/mt_bench.jsonl, including examples with length of 4k tokens, without changing anything in the script. I r…
-
Hi
https://colab.research.google.com/drive/1YtXE_JKVntkGK14Yo9thjCjPMVzhA71d?usp=sharing
Here is the colab, but it doesn't run in colab it stops after a while due to memory overload or something…
-
About int8_kv_cache I did some tests:
> Test model is mistral-7b
> My test inference code comes from `run.py`, supplementing runner.generate's time-consuming statistics,Added warm up code.
> Input…