Will NTK RoPE be supported?

Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

https://lightning.ai

Apache License 2.0

10.33k stars 1.02k forks source link

Will NTK RoPE be supported? #482

Closed jxtngx closed 3 months ago

jxtngx commented 1 year ago

A discord community member asked if NTK RoPE will be supported in the future, and offered the example given below:

def build_rope_cache(
    seq_len: int, 
    max_seq_length:int, 
    n_elem: int, 
    dtype: torch.dtype, 
    device: torch.device, 
    base: int = 10000, 
    condense_ratio: int = 1, 
    scaling_factor: float = 1.0
) -> RoPECache:

    if seq_len > max_position_embeddings:
        base = base * (
            (scaling_factor * seq_len / max_seq_length) - (scaling_factor - 1)
        )  (n_elem / (n_elem - 2))

    theta = 1.0 / (base  (torch.arange(0, n_elem, 2, device=device) / n_elem))

    seq_idx = torch.arange(seq_len, device=device) / condense_ratio

    idx_theta = torch.outer(seq_idx, theta).repeat(1, 2)

    cos, sin = torch.cos(idx_theta), torch.sin(idx_theta)

    if dtype in (torch.float16, torch.bfloat16, torch.int8):
        return cos.half(), sin.half()
    return cos, sin

jxtngx commented 1 year ago

Associated paper is Extending Context Window of Large Language Models via Positional Interpolation

https://arxiv.org/abs/2306.15595

windprak commented 1 year ago

It is already implemented in the transformers library: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py

Here is the original reddit post about NTK Scaled RoPE: https://www.reddit.com/comments/14lz7j5 I think it would be important to feature this or Linear Scaled RoPE to enable Long-Context -Training.

carmocca commented 1 year ago

Are there any "popular" checkpoints that use this technique? We'd only accept this contribution in that case.

windprak commented 1 year ago

Yes, Metas new Code Llama uses that technique: https://about.fb.com/news/2023/08/code-llama-ai-for-coding/ Paper: https://arxiv.org/pdf/2308.12950.pdf As well as pretty much every other Llama model claiming above 4k context for example: https://huggingface.co/upstage/Llama-2-70b-instruct It is recommended to train with the desired NTK setting but it can be also applied without training. Like mentioned before, these techniques are implemented in the transformers library and a very simple change.

carmocca commented 1 year ago

Can you specifically point where in the code-llama paper this is mentioned? The hub config doesn't specify anything: https://huggingface.co/codellama/CodeLlama-34b-hf/blob/main/config.json#L18 and AFAIK the only used a custom rope theta parameter, but not this scaling technique mentioned here.

We already support changing the condensation ratio: https://github.com/Lightning-AI/lit-gpt/blob/main/lit_gpt/config.py#L51 which is what the SuperHOT blogpost described.

If this technique is useful without training, then we could add support for it without any relevant checkpoint that requires it. It would be nice if you could run try it out in this codebase and run some experiments to demonstrate its performance with longer context lengths.

carmocca commented 1 year ago

I'm hesitant about adding this since It's not clear to me what is the expected and popular formula. However, can play now modify the base to your liking after #464

config = Config.from_name("falcon-7b")
# scale using whichever formula you want, you can access the config for values
config.rope_base = ...

windprak commented 1 year ago

Thanks, I thought the NTK was equal to what Code-Llama did as they quoted it. I didn't expect that they just changed that one variable. Nevertheless I think NTK could be useful for inference of models that weren't trained for longer contexts. One more quick question as I want to try this out now. How do I change the sequence length according to the new rope base values then? With block size? I anything else to look out for?

carmocca commented 1 year ago

Yes, by changing the block size. If you use the technique described in https://kaiokendev.github.io/context, you'll need to adjust the rope_condensation_ratio. See our existing configs to find some examples

Also - fresh out of the oven - yet another rope extension technique https://github.com/jquesnelle/yarn