Closed yiakwy-xpu-ml-framework-team closed 7 months ago
Thank you for bringing up this issue!
Firstly, ChunkLlama does not utilize sparse attention; it is a full-attention model. Indeed, maintaining the locality of neighboring tokens does help the model to handle longer inputs and achieve lower PPL. For 70B models, the extended context length can even exceed 100k.
However, based on our human evaluation of real-world tasks, training-free methods still face challenges in practical long-context tasks, especially with difficult questions. In my opinion, while training-free methods may achieve satisfactory results based on PPL, it is still premature to conclude their effectiveness given their poor performance in some areas. A recent study in this field has demonstrated the "real context length" of LLMs and it may be helpful for you!
@ChenxinAn-fdu Thanks for sharing. Wish deep interaction afterwards.
Looking forward to further collaboration!
We knew that models with limited sequence length support (usually 4k ~ 32k) from training process, needs rotary embedding and NTK-Aware Scalar for attention. This enables ~ 4x sequence length in inference.
Another fact is that with attention windows, and theory of receptive field, there are already a lot of papers demonstrated a great recession of loss and PPL for inputs of sequence length > 100k.
In this paper, we are featured with 100k extension in inference without the support from training (distributed flash attention for example), but only zero-shot tests of 3k length of inputs and few shot tests of 16k length inputs provided.
Could we draw a attention that this kind of window attention support unlimited (extend to >100k inputs) sequence length of inputs without support of training (from 4k/8k model)?