-
Nice work!
I am wondering whether this attention sink magic is still needed for LLMs that has been already trained with window attention (e.g. [mistral](https://github.com/mistralai/mistral-src)). …
-
### 是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
### 该问题是否在FAQ中有解答? | Is there an existing…
-
I tried naively to add examples in https://github.com/mit-han-lab/streaming-llm/blob/main/data/mt_bench.jsonl, including examples with length of 4k tokens, without changing anything in the script. I r…
-
Hi
https://colab.research.google.com/drive/1YtXE_JKVntkGK14Yo9thjCjPMVzhA71d?usp=sharing
Here is the colab, but it doesn't run in colab it stops after a while due to memory overload or something…
-
About int8_kv_cache I did some tests:
> Test model is mistral-7b
> My test inference code comes from `run.py`, supplementing runner.generate's time-consuming statistics,Added warm up code.
> Input…
-
### Describe the bug
Installation characteristics.
1. How long does it take to install the program?
2. How much free space is required for installation?
3. On which disks is it installed?
4. In…
-
`raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled`
Any plans to get Metal support for us M2 users without CUDA? Thanks!