ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.58k stars 9.24k forks source link

Support for Phi-3 models #6849

Open criminact opened 4 months ago

criminact commented 4 months ago

Microsoft recently released Phi-3 models in 3 variants (mini, small & medium). Can we add support for this new family of models.

criminact commented 4 months ago

image

Model directly works 👍

GGUF link - https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf Command -main -m Phi-3-mini-4k-instruct-q4.gguf -p "<|system|>\nYou are a helpful AI assistant.<|end|>\n<|user|>\nHow to explain Internet for a medieval knight?<|end|>\n<|assistant|>"

K-Mistele commented 4 months ago

Have you tested compatibility with the server? There probably needs to be a new prompt template since it's not compatible with the current ones AFAIK. Happy to dig into this in the next couple of days.

sorasoras commented 4 months ago

I believe llama cpp does not support long rope which is use by 128k variant.

LiuChaoXD commented 4 months ago

I believe llama cpp does not support long rope which is use by 128k variant.

yeah, I tried to convert 128K version. python convert.py .... Raise NotImplementedError: Unknown rope scaling type: longrope

MoonRide303 commented 4 months ago

Also NotImplementedError: Architecture 'Phi3ForCausalLM' not supported! from convert-hf-to-gguf.py.

apepkuss commented 4 months ago

@MoonRide303 Same error with convert-hf-to-gguf.py.

candre23 commented 4 months ago

Model directly works 👍

Only partially. MS is using some new rope technique they're calling "longrope". As-is, LCPP will work ok for the first few gens but will then abruptly go insane. This new longrope thing is likely the culprit.

K-Mistele commented 4 months ago

Ah yes - it looks like they published the paper in April. Details here, PDF here

Dampfinchen commented 4 months ago

This model is insane for its size.

mirek190 commented 4 months ago

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

MoonRide303 commented 4 months ago

I have to admit it seems to be pretty smart, even that smallest 3.8B - looks worthy of fully supporting it. Many 7B and bigger models struggled with those simple tests, yet Phi-3 handled it pretty nicely: image image

mirek190 commented 4 months ago

Tested with llamacpp. fp16 and Q8 version.

Do you also have a problem : generating tokens until I manually stop it?

I had to add -r "----" -r "---" -r "<|end|>>" -r "### Answer:"

criminact commented 4 months ago

@mirek190 The model doesn't stop generating you mean? Yeah faced it too. This PR: #6851 handles it.

mirek190 commented 4 months ago

Screenshot 2024-04-23 190716

Not too bad ... not level llama 8b but still

phi-3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|end|>\n<|assistant|>This scenario appears to be a hypothetical situation rather than an actual event. However, it is intended to evoke thought about ethical dilemmas in the medical profession.

In reality, such situations are highly unlikely due to professional and personal boundaries that doctors maintain, even if they happen to know the patient personally through other connections (e.g., being friends or acquaintances).

To address this hypothetical situation responsibly:

1. The surgeon would not be able to perform surgery on their biological son based solely on familial relations due to ethical standards that prevent personal relationships from interfering with professional duties and patient care.

2. If the boy's father, who is no longer alive in this scenario, had appointed a close family member or friend as his medical proxy for such circumstances, then it would be possible for them (a relative) to make decisions on behalf of their son.

3. In such cases where an immediate decision must be made about the child's treatment but no designated proxy is available, hospital ethics committees or legal authorities may need to step in to determine the best course of action for the patient's well-being.

It is important to remember that real-life situations involving surgeons and their family members are handled with utmost care for the welfare of both parties involved, always prioritizing medical ethics and professional obligations.

llama 3

A father and son are in a car accident where the father is killed. The ambulance brings the son to the hospital. He needs immediate surgery. In the operating room, the surgeon looks at the boy and says "I can't operate on him, he's my son!"
<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is a classic lateral thinking puzzle!

The answer lies in the fact that the surgeon is not the biological father of the boy. The correct interpretation is that the surgeon is an adoptive parent or a stepfather.

If we consider other possible scenarios, it's also possible that the surgeon is a woman who has adopted the son or is his stepmother. In any case, the key point is that the surgeon is not biologically related to the boy as his father.

Llama 3 is on totally different level comparing to phi-3 ...

tristandruyen commented 4 months ago

Doing my part by adding the chat template :) https://github.com/ggerganov/llama.cpp/pull/6857

dspasyuk commented 4 months ago

Model seems working fine on my end, just endlessly generating text. ../llama.cpp/main --model /home/denis/Downloads/phi-3-mini-4k-instruct.Q8_0.gguf --n-gpu-layers 35 -ins --interactive --keep -1 --n-predict -1 --simple-io -b 2048 --ctx_size 0 --temp 0.1 --top_k 10 -mg 0 --multiline-input --repeat_penalty 1.12 -t 4 -r "/n>" -p <|system|>Your name is Alice. You are kind, honest, logical, precise, good at writing and mathematics assistant. <|end|> image

criminact commented 4 months ago

Closing this since PR: https://github.com/ggerganov/llama.cpp/pull/6857 was merged into master with support for Phi-3 4K context length.

s-kostyaev commented 4 months ago

What about 128k context length variant?

lukestanley commented 4 months ago

Support for 128K context length seems pretty important to me for "Phi-3" support to be considered "done", right? @criminact

criminact commented 4 months ago

Status: Phi-3 4K models are supported in master after https://github.com/ggerganov/llama.cpp/pull/6857 merge

Phi-3 128K models aren't supported yet (as of 24th Apr 2024)

phalexo commented 4 months ago

template for llamacpp

main.exe --model models/new3/Phi-3-mini-4k-instruct-fp16.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 0 --interactive -ins -ngl 99 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

Are templates different for 4K vs. 128K?

jtomek commented 4 months ago

Hi guys, what to do with this error? unknown model architecture: 'phi3'

I fine-tuned my own phi-3 and converted it to gguf with this command: python llama.cpp/convert-hf-to-gguf.py midesk-private --outfile midesk-private-gguf-4k-v0.0.gguf

I get the error when I run

from llama_cpp import Llama
llm = Llama(
      model_path="./midesk-private-gguf-4k-v0.0.gguf"
)

I would be very thankful for any help or push in the right direction.

phalexo commented 4 months ago

With reduced context size of 60000 I can load a 128K model. The prompting is still messed up though.

./main --model /opt/data/pjh64/Phi-3-mini-128K-Instruct.gguf/phi-3-mini-128K-Instruct_q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 60000 --interactive -ins -ngl 33 --simple-io --in-prefix "<|user|>\n" --in-suffix "<|end|>\n<|assistant|>" -p "<|system|>You are a helpful assistant.<|end|>\n "

main: interactive mode on. Reverse prompt: '### Instruction:

' Input prefix: '<|user|>\n' Input suffix: '<|end|>\n<|assistant|>' sampling: repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 60000, n_batch = 2048, n_predict = -1, n_keep = 12

== Running in interactive mode. ==

<|system|>You are a helpful assistant.<|end|>\n

<|user|>\nHello. Tell me a story. <|end|>\n<|assistant|>Once upon a time, in the serene land of Veridia, there was an enchanted forest known as Luminae Woods. The woodland sparkled under moonlight with each tree shimmering like living stars.

In this mystical place lived Elara, a beautiful young maiden blessed with iridescent hair and eyes that mirrored the depth of the cosmos.

Elara had one unique trait - she could converse with nature itself. She conversed with trees whispering secrets in rustling leaves, birds humming songs only they could understand.

One fateful day, a dark cloud loomed over Veridia. A malicious sorcerer named Malachar desired to steal the magical essence of Veridia for his own nefarious purposes.

Upon hearing this news, Elara decided she wouldn't let her homeland fall into despair. With bravery coursing through her veins and courage in her heart, she embarked on a peril ous quest to stop the wicked sorcerer.

With each passing day, Elara encountered numerous trials that tested her courage, wisdom, and resilience. She journeyed across treacherous terrains, braved wild beasts and outsmarted magical illusions crafted by Malachar himself.

As Elara ventured deeper into the darkness of Maleficent's lair, she came face-to-face with the sorcerer. A battle of magic unfolded - a clash between good and evil, light against dark.

Despite feeling overwhelmed by Malachar_s mightier spells, Elara held on to her heart's purity, believing in herself and her mission for Veridia's peace.

In the end, it was Elara who prevailed. With a final surge of magic she wielded from within, she vanquished Malachar, breaking his dark curse over Veridia.

Afterwards, with peace restored to Veridia and its inhabitants living in harmony once more, Elara became the beloved guardian of Luminae Woods, continuing her duty as the voice of nature itself.

Thus ends a tale about courage, goodness, and the power that resides within us all. It's a timeless story of how one person can make an immense difference in preserving peace and harmony.

And so, dear listener, let this legend inspire you to face your own battles with bravery and integrity _ for it is these virtues which truly define the worthiness of any individual or character.<|end|>

<|user|>\n

ryao commented 4 months ago

@phalexo You should use -e as an argument too.

nullnuller commented 4 months ago

Status: Phi-3 4K models are supported in master after #6857 merge

Phi-3 128K models aren't supported yet (as of 24th Apr 2024)

Hi, any update on the 128k support?

smartjx commented 4 months ago

Any update on 128K?

mirek190 commented 4 months ago

Any update on 128K? :)

ggerganov commented 4 months ago

For 128K you can help with summarizing and providing references of what is needed to be implemented

maxrubin629 commented 4 months ago

For 128K you can help with summarizing and providing references of what is needed to be implemented

I believe all that's needed is LongRoPE, that's the only distinguishing factor between 4k and 128k context. Phi3 technical report: https://arxiv.org/pdf/2404.14219

"We also introduce a long context version via LongRope [DZZ+24] that extends the context length to 128K, called phi-3-mini-128K."

LongRoPE paper [DZZ+24]: https://arxiv.org/pdf/2402.13753

LorenzoBiassio commented 4 months ago

I found a model of Phi-3-128k that works: https://huggingface.co/MoMonir/Phi-3-mini-128k-instruct-GGUF

The downside is that it only works with a maximum of 64k tokens set in the Model Initialization, and if set higher it justs fails to load, here is the error:

{
  "title": "Failed to load model",
  "cause": "",
  "errorData": {
    "n_ctx": 131072,
    "n_batch": 512,
    "n_gpu_layers": 33
  },
  "data": {
    "memory": {
      "ram_capacity": "13.81 GB",
      "ram_unused": "9.30 GB"
    },
    "gpu": {
      "type": "AmdROCm",
      "vram_recommended_capacity": "6.99 GB",
      "vram_unused": "6.85 GB"
    },
    "os": {
      "platform": "win32",
      "version": "10.0.22631",
      "supports_avx2": true
    },
    "app": {
      "version": "0.2.20",
      "downloadsDir": "C:\\Users\\lorenzo\\.cache\\lm-studio\\models"
    },
    "model": {}
  }
}

It seems to work because the arch is set to llama:

{
  "name": "phi3",
  "arch": "llama",
  "quant": "Q8_0",
  "context_length": 131072,
  "embedding_length": 3072,
  "num_layers": 32,
  "rope": {
    "freq_base": 10000,
    "dimension_count": 96
  },
  "head_count": 32,
  "head_count_kv": 32,
  "parameters": "7B"
}

I've tried 4 different models, and all of them had arch set to phi3, and none worked. I don't have a real solution, but this is working for me.

flatsiedatsie commented 4 months ago

According to the paper, code will become available here: https://github.com/microsoft/LongRoPE

it's currently giving a 404 though.

phalexo commented 4 months ago

We had a discussion about this on Hugging Face, in a few places. I am about to try to change the architecture name in one of the conversion scripts to see if that works. There is an even weirder problem that I see with ollama. It loads 3.2GB model multiple times into all of my GPUs, total VRAM around 49GB.

On Fri, Apr 26, 2024 at 4:37 PM LorenzoBiassio @.***> wrote:

I found a model of Phi-3-128k that works: https://huggingface.co/MoMonir/Phi-3-mini-128k-instruct-GGUF

The downside is that it only works with a maximum of 64k tokens set in the Model Initialization, and if set higher it justs fails to load, here is the error:

{ "title": "Failed to load model", "cause": "", "errorData": { "n_ctx": 131072, "n_batch": 512, "n_gpu_layers": 33 }, "data": { "memory": { "ram_capacity": "13.81 GB", "ram_unused": "9.30 GB" }, "gpu": { "type": "AmdROCm", "vram_recommended_capacity": "6.99 GB", "vram_unused": "6.85 GB" }, "os": { "platform": "win32", "version": "10.0.22631", "supports_avx2": true }, "app": { "version": "0.2.20", "downloadsDir": "C:\Users\lorenzo\.cache\lm-studio\models" }, "model": {} } }

It seems to work because the arch is set to llama:

{ "name": "phi3", "arch": "llama", "quant": "Q8_0", "context_length": 131072, "embedding_length": 3072, "num_layers": 32, "rope": { "freq_base": 10000, "dimension_count": 96 }, "head_count": 32, "head_count_kv": 32, "parameters": "7B" }

I've tried 4 different models, and all of them had arch set to phi3, and none worked. I don't have a real solution, but this is working for me.

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/issues/6849#issuecomment-2080086011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDD3ZKDNJ4I32C6CMUZMWLY7K3KPAVCNFSM6AAAAABGVFKWYKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBQGA4DMMBRGE . You are receiving this because you were mentioned.Message ID: @.***>

dmsweetser commented 4 months ago

When I look in the Microsoft published 128k model, specifically in the source for modeling_phi3.py (https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/resolve/main/modeling_phi3.py), I see this implementation of rope (and in particular maybe Phi3SuScaledRotaryEmbedding). Does this help at all?

 def _init_rope(self):
        if self.rope_scaling is None:
            self.rotary_emb = Phi3RotaryEmbedding(
                self.head_dim,
                max_position_embeddings=self.max_position_embeddings,
                base=self.rope_theta,
            )
        else:
            scaling_type = self.config.rope_scaling["type"]
            short_factor = self.config.rope_scaling["short_factor"]
            long_factor = self.config.rope_scaling["long_factor"]

            if scaling_type == "su":
                self.rotary_emb = Phi3SuScaledRotaryEmbedding(
                    self.head_dim,
                    short_factor,
                    long_factor,
                    max_position_embeddings=self.max_position_embeddings,
                    original_max_position_embeddings=self.original_max_position_embeddings,
                    base=self.rope_theta,
                )
            elif scaling_type == "yarn":
                self.rotary_emb = Phi3YarnScaledRotaryEmbedding(
                    self.head_dim,
                    short_factor,
                    long_factor,
                    max_position_embeddings=self.max_position_embeddings,
                    original_max_position_embeddings=self.original_max_position_embeddings,
                    base=self.rope_theta,
                )
            else:
                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
......
# Copied from transformers.models.gemma.modeling_gemma.GemmaRotaryEmbedding with gemma->phi3, Gemma->Phi3
class Phi3RotaryEmbedding(nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        self.register_buffer("inv_freq", None, persistent=False)

    @torch.no_grad()
    def forward(self, x, position_ids, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if self.inv_freq is None:
            self.inv_freq = 1.0 / (
                self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
            )
        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
        position_ids_expanded = position_ids[:, None, :].float()
        # Force float32 since bfloat16 loses precision on long contexts
        # See https://github.com/huggingface/transformers/pull/29285
        device_type = x.device.type
        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
        with torch.autocast(device_type=device_type, enabled=False):
            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
            emb = torch.cat((freqs, freqs), dim=-1)
            cos = emb.cos()
            sin = emb.sin()
        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)

class Phi3SuScaledRotaryEmbedding(Phi3RotaryEmbedding):
    def __init__(
        self,
        dim,
        short_factor,
        long_factor,
        original_max_position_embeddings=2048,
        max_position_embeddings=2048,
        base=10000,
        device=None,
    ):
        super().__init__(dim, max_position_embeddings, base, device)

        self.short_factor = short_factor
        self.long_factor = long_factor
        self.original_max_position_embeddings = original_max_position_embeddings

    def _calc_scaling_factor(self, scale):
        if scale <= 1.0:
            return 1.0
        return math.sqrt(1 + math.log(scale) / math.log(self.original_max_position_embeddings))

    @torch.no_grad()
    def forward(self, x, position_ids, seq_len=None):
        seq_len = torch.max(position_ids) + 1
        if seq_len > self.original_max_position_embeddings:
            ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
        else:
            ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)

        self.inv_freq = 1.0 / (
            ext_factors
            * self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
        )
        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
        position_ids_expanded = position_ids[:, None, :].float()

        # Force float32 since bfloat16 loses precision on long contexts
        # See https://github.com/huggingface/transformers/pull/29285
        device_type = x.device.type
        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
        with torch.autocast(device_type=device_type, enabled=False):
            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
            scaling_factor = self._calc_scaling_factor(
                self.max_position_embeddings / self.original_max_position_embeddings
            )
            emb = torch.cat((freqs, freqs), dim=-1)
            cos = emb.cos() * scaling_factor
            sin = emb.sin() * scaling_factor
        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)

class Phi3YarnScaledRotaryEmbedding(Phi3RotaryEmbedding):
    def __init__(
        self,
        dim,
        short_factor,
        long_factor,
        original_max_position_embeddings=2048,
        max_position_embeddings=2048,
        base=10000,
        device=None,
    ):
        super().__init__(dim, max_position_embeddings, base, device)

        self.short_factor = short_factor
        self.long_factor = long_factor
        self.original_max_position_embeddings = original_max_position_embeddings

    def _calc_scaling_factor(self, scale):
        if scale <= 1.0:
            return 1.0
        return 0.1 * math.log(scale) + 1.0

    @torch.no_grad()
    def forward(self, x, position_ids, seq_len=None):
        seq_len = torch.max(position_ids) + 1
        if seq_len > self.original_max_position_embeddings:
            ext_factors = torch.tensor(self.long_factor, dtype=torch.float32, device=x.device)
        else:
            ext_factors = torch.tensor(self.short_factor, dtype=torch.float32, device=x.device)

        self.inv_freq = 1.0 / (
            ext_factors
            * self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
        )
        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
        position_ids_expanded = position_ids[:, None, :].float()

        # Force float32 since bfloat16 loses precision on long contexts
        # See https://github.com/huggingface/transformers/pull/29285
        device_type = x.device.type
        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
        with torch.autocast(device_type=device_type, enabled=False):
            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
            scaling_factor = self._calc_scaling_factor(
                self.max_position_embeddings / self.original_max_position_embeddings
            )
            emb = torch.cat((freqs, freqs), dim=-1)
            cos = emb.cos() * scaling_factor
            sin = emb.sin() * scaling_factor
        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
dmsweetser commented 4 months ago

Here are the relevant bits from the config.json:

  "max_position_embeddings": 131072,
  "original_max_position_embeddings": 4096,
  "rope_scaling": {
    "long_factor": [
      1.0299999713897705,
      1.0499999523162842,
      1.0499999523162842,
      1.0799999237060547,
      1.2299998998641968,
      1.2299998998641968,
      1.2999999523162842,
      1.4499999284744263,
      1.5999999046325684,
      1.6499998569488525,
      1.8999998569488525,
      2.859999895095825,
      3.68999981880188,
      5.419999599456787,
      5.489999771118164,
      5.489999771118164,
      9.09000015258789,
      11.579999923706055,
      15.65999984741211,
      15.769999504089355,
      15.789999961853027,
      18.360000610351562,
      21.989999771118164,
      23.079999923706055,
      30.009998321533203,
      32.35000228881836,
      32.590003967285156,
      35.56000518798828,
      39.95000457763672,
      53.840003967285156,
      56.20000457763672,
      57.95000457763672,
      59.29000473022461,
      59.77000427246094,
      59.920005798339844,
      61.190006256103516,
      61.96000671386719,
      62.50000762939453,
      63.3700065612793,
      63.48000717163086,
      63.48000717163086,
      63.66000747680664,
      63.850006103515625,
      64.08000946044922,
      64.760009765625,
      64.80001068115234,
      64.81001281738281,
      64.81001281738281
    ],
    "short_factor": [
      1.05,
      1.05,
      1.05,
      1.1,
      1.1,
      1.1500000000000001,
      1.2000000000000002,
      1.2500000000000002,
      1.3000000000000003,
      1.3500000000000003,
      1.5000000000000004,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.0500000000000007,
      2.0500000000000007,
      2.0500000000000007,
      2.1000000000000005,
      2.1000000000000005,
      2.1000000000000005,
      2.1500000000000004,
      2.1500000000000004,
      2.3499999999999996,
      2.549999999999999,
      2.5999999999999988,
      2.5999999999999988,
      2.7499999999999982,
      2.849999999999998,
      2.849999999999998,
      2.9499999999999975
    ],
    "type": "su"
  },
  "rope_theta": 10000.0
dmsweetser commented 4 months ago

For anyone else you wants to spam search Microsoft's GitHub in hopes this suddenmagically appears- https://github.com/search?q=org%3Amicrosoft+LongRoPE&type=code

dmsweetser commented 4 months ago

I found this issue, where mistral.rs has implemented LongRoPE in Rust:

https://github.com/huggingface/candle/issues/2123

Here's the source:

https://github.com/EricLBuehler/mistral.rs/blob/6334b30fdf6447fa787dcbedb032fb825c22ae1f/mistralrs-core/src/models/layers.rs#L84

I had ChatGPT (aka ClosedAI) convert this for me because I don't know Rust OR C++:

#include <vector>
#include <cmath>
#include <tensor.h> // Assuming the existence of a "Tensor" class for tensor operations

enum ScaledRopeType { Su, Yarn }; // Define ScaledRopeType enum

struct ScaledRopeParams {
    std::vector<float> short_factor;
    std::vector<float> long_factor;
    ScaledRopeType scaling_type;
};

struct PhiRotaryEmbedding {
    Tensor short_cos;
    Tensor short_sin;
    std::optional<Tensor> long_cos;
    std::optional<Tensor> long_sin;
    int original_max_position_embeddings;

    static PhiRotaryEmbedding new(DType dtype, const Config& cfg, const Device& dev) {
        std::optional<ScaledRopeParams> scaled_params;
        // Initialize scaled_params if cfg.rope_scaling has value
        if (cfg.rope_scaling) {
            scaled_params = ScaledRopeParams{
                cfg.rope_scaling->short_factor,
                cfg.rope_scaling->long_factor,
                cfg.rope_scaling->scaling_type
            };
        }
        int max_seq_len = cfg.max_position_embeddings;
        int dim = cfg.head_dim();

        if (scaled_params) {
            // Calculate scale
            double scale = static_cast<double>(cfg.max_position_embeddings) / static_cast<double>(cfg.original_max_position_embeddings);
            double scaling_factor = (scale <= 1.0) ? 1.0 : 
                (scaled_params->scaling_type == ScaledRopeType::Su) ?
                    (1.0 + std::log(scale) / std::log(static_cast<double>(cfg.original_max_position_embeddings))) : 
                    (0.1 * std::log(scale) + 1.0);

            // Calculate inv freqs for short, long
            std::vector<float> inv_freq_long(dim / 2);
            for (int k = 0, i = 0; k < dim; k += 2, ++i) {
                inv_freq_long[i] = 1.0f / (scaled_params->long_factor[i] * std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim)));
            }
            std::vector<float> inv_freq_short(dim / 2);
            for (int k = 0, i = 0; k < dim; k += 2, ++i) {
                inv_freq_short[i] = 1.0f / (scaled_params->short_factor[i] * std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim)));
            }
            int inv_freq_len = inv_freq_long.size();

            Tensor t = Tensor::arange(0u32, max_seq_len, dev).to_dtype(dtype).reshape({ max_seq_len, 1 });

            // Calculate sin,cos for long
            Tensor inv_freq_long_tensor = Tensor::from_vec(inv_freq_long, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor freqs_long = t.matmul(inv_freq_long_tensor);
            Tensor long_sin = freqs_long.sin().mul(scaling_factor);
            Tensor long_cos = freqs_long.cos().mul(scaling_factor);

            // Calculate sin,cos for short
            Tensor inv_freq_short_tensor = Tensor::from_vec(inv_freq_short, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor freqs_short = t.matmul(inv_freq_short_tensor);
            Tensor short_sin = freqs_short.sin().mul(scaling_factor);
            Tensor short_cos = freqs_short.cos().mul(scaling_factor);

            return PhiRotaryEmbedding {
                short_cos,
                short_sin,
                long_cos,
                long_sin,
                cfg.original_max_position_embeddings
            };
        } else {
            std::vector<float> inv_freq(dim / 2);
            for (int i = 0; i < dim; i += 2) {
                inv_freq[i / 2] = 1.0f / std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim));
            }
            int inv_freq_len = inv_freq.size();
            Tensor inv_freq_tensor = Tensor::from_vec(inv_freq, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor t = Tensor::arange(0u32, max_seq_len, dev).to_dtype(dtype).reshape({ max_seq_len, 1 });
            Tensor freqs = t.matmul(inv_freq_tensor);
            Tensor sin = freqs.sin();
            Tensor cos = freqs.cos();
            return PhiRotaryEmbedding {
                cos,
                sin,
                std::nullopt,
                std::nullopt,
                cfg.original_max_position_embeddings
            };
        }
    }

    std::pair<const Tensor&, const Tensor&> get_long_or_short_sin_cos(const std::vector<size_t>& seqlen_offsets) const {
        if (!long_cos.has_value()) {
            return { short_sin, short_cos };
        }
        size_t seq_len = *std::max_element(seqlen_offsets.begin(), seqlen_offsets.end()) + 1;
        if (seq_len > original_max_position_embeddings) {
            return { *long_sin, *long_cos };
        } else {
            return { short_sin, short_cos };
        }
    }

    std::pair<Tensor, Tensor> forward(const Tensor& q, const Tensor& k, const std::vector<size_t>& seqlen_offsets) const {
        auto [_b_sz, _h, seq_len, _n_embd] = q.dims4();
        std::vector<Tensor> q_embeds;
        std::vector<Tensor> k_embeds;
        auto [sin, cos] = get_long_or_short_sin_cos(seqlen_offsets);
        for (auto offset : seqlen_offsets) {
            Tensor cos_chunk = cos.narrow(0, offset, seq_len);
            Tensor sin_chunk = sin.narrow(0, offset, seq_len);
            Tensor q_embed = candle_nn::rotary_emb::rope(q.contiguous(), cos_chunk, sin_chunk);
            Tensor k_embed = candle_nn::rotary_emb::rope(k.contiguous(), cos_chunk, sin_chunk);
            q_embeds.push_back(q_embed);
            k_embeds.push_back(k_embed);
        }
        return { Tensor::cat(q_embeds, 0), Tensor::cat(k_embeds, 0) };
    }
};

Looks great to me.

maxrubin629 commented 4 months ago

For anyone else you wants to spam search Microsoft's GitHub in hopes this suddenmagically appears- https://github.com/search?q=org%3Amicrosoft+LongRoPE&type=code

Looks like there's been a couple lines added that reference PhiLongRoPE!

maxrubin629 commented 4 months ago

I found this issue, where mistral.rs has implemented LongRoPE in Rust:

huggingface/candle#2123

Here's the source:

https://github.com/EricLBuehler/mistral.rs/blob/6334b30fdf6447fa787dcbedb032fb825c22ae1f/mistralrs-core/src/models/layers.rs#L84

I had ChatGPT (aka ClosedAI) convert this for me because I don't know Rust OR C++:

#include <vector>
#include <cmath>
#include <tensor.h> // Assuming the existence of a "Tensor" class for tensor operations

enum ScaledRopeType { Su, Yarn }; // Define ScaledRopeType enum

struct ScaledRopeParams {
    std::vector<float> short_factor;
    std::vector<float> long_factor;
    ScaledRopeType scaling_type;
};

struct PhiRotaryEmbedding {
    Tensor short_cos;
    Tensor short_sin;
    std::optional<Tensor> long_cos;
    std::optional<Tensor> long_sin;
    int original_max_position_embeddings;

    static PhiRotaryEmbedding new(DType dtype, const Config& cfg, const Device& dev) {
        std::optional<ScaledRopeParams> scaled_params;
        // Initialize scaled_params if cfg.rope_scaling has value
        if (cfg.rope_scaling) {
            scaled_params = ScaledRopeParams{
                cfg.rope_scaling->short_factor,
                cfg.rope_scaling->long_factor,
                cfg.rope_scaling->scaling_type
            };
        }
        int max_seq_len = cfg.max_position_embeddings;
        int dim = cfg.head_dim();

        if (scaled_params) {
            // Calculate scale
            double scale = static_cast<double>(cfg.max_position_embeddings) / static_cast<double>(cfg.original_max_position_embeddings);
            double scaling_factor = (scale <= 1.0) ? 1.0 : 
                (scaled_params->scaling_type == ScaledRopeType::Su) ?
                    (1.0 + std::log(scale) / std::log(static_cast<double>(cfg.original_max_position_embeddings))) : 
                    (0.1 * std::log(scale) + 1.0);

            // Calculate inv freqs for short, long
            std::vector<float> inv_freq_long(dim / 2);
            for (int k = 0, i = 0; k < dim; k += 2, ++i) {
                inv_freq_long[i] = 1.0f / (scaled_params->long_factor[i] * std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim)));
            }
            std::vector<float> inv_freq_short(dim / 2);
            for (int k = 0, i = 0; k < dim; k += 2, ++i) {
                inv_freq_short[i] = 1.0f / (scaled_params->short_factor[i] * std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim)));
            }
            int inv_freq_len = inv_freq_long.size();

            Tensor t = Tensor::arange(0u32, max_seq_len, dev).to_dtype(dtype).reshape({ max_seq_len, 1 });

            // Calculate sin,cos for long
            Tensor inv_freq_long_tensor = Tensor::from_vec(inv_freq_long, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor freqs_long = t.matmul(inv_freq_long_tensor);
            Tensor long_sin = freqs_long.sin().mul(scaling_factor);
            Tensor long_cos = freqs_long.cos().mul(scaling_factor);

            // Calculate sin,cos for short
            Tensor inv_freq_short_tensor = Tensor::from_vec(inv_freq_short, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor freqs_short = t.matmul(inv_freq_short_tensor);
            Tensor short_sin = freqs_short.sin().mul(scaling_factor);
            Tensor short_cos = freqs_short.cos().mul(scaling_factor);

            return PhiRotaryEmbedding {
                short_cos,
                short_sin,
                long_cos,
                long_sin,
                cfg.original_max_position_embeddings
            };
        } else {
            std::vector<float> inv_freq(dim / 2);
            for (int i = 0; i < dim; i += 2) {
                inv_freq[i / 2] = 1.0f / std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim));
            }
            int inv_freq_len = inv_freq.size();
            Tensor inv_freq_tensor = Tensor::from_vec(inv_freq, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor t = Tensor::arange(0u32, max_seq_len, dev).to_dtype(dtype).reshape({ max_seq_len, 1 });
            Tensor freqs = t.matmul(inv_freq_tensor);
            Tensor sin = freqs.sin();
            Tensor cos = freqs.cos();
            return PhiRotaryEmbedding {
                cos,
                sin,
                std::nullopt,
                std::nullopt,
                cfg.original_max_position_embeddings
            };
        }
    }

    std::pair<const Tensor&, const Tensor&> get_long_or_short_sin_cos(const std::vector<size_t>& seqlen_offsets) const {
        if (!long_cos.has_value()) {
            return { short_sin, short_cos };
        }
        size_t seq_len = *std::max_element(seqlen_offsets.begin(), seqlen_offsets.end()) + 1;
        if (seq_len > original_max_position_embeddings) {
            return { *long_sin, *long_cos };
        } else {
            return { short_sin, short_cos };
        }
    }

    std::pair<Tensor, Tensor> forward(const Tensor& q, const Tensor& k, const std::vector<size_t>& seqlen_offsets) const {
        auto [_b_sz, _h, seq_len, _n_embd] = q.dims4();
        std::vector<Tensor> q_embeds;
        std::vector<Tensor> k_embeds;
        auto [sin, cos] = get_long_or_short_sin_cos(seqlen_offsets);
        for (auto offset : seqlen_offsets) {
            Tensor cos_chunk = cos.narrow(0, offset, seq_len);
            Tensor sin_chunk = sin.narrow(0, offset, seq_len);
            Tensor q_embed = candle_nn::rotary_emb::rope(q.contiguous(), cos_chunk, sin_chunk);
            Tensor k_embed = candle_nn::rotary_emb::rope(k.contiguous(), cos_chunk, sin_chunk);
            q_embeds.push_back(q_embed);
            k_embeds.push_back(k_embed);
        }
        return { Tensor::cat(q_embeds, 0), Tensor::cat(k_embeds, 0) };
    }
};

Looks great to me.

I passed the original Rust and ClosedAI — I mean ChatGPT's implementation to Claude 3 Opus. The rest of this comment is written by Claude:

#include <vector>
#include <cmath>
#include <optional>
#include <algorithm>
#include <utility>
#include "tensor.h" // Assuming the existence of a "Tensor" class for tensor operations

enum class ScaledRopeType { Su, Yarn }; // Define ScaledRopeType enum

struct ScaledRopeParams {
    std::vector<float> short_factor;
    std::vector<float> long_factor;
    ScaledRopeType scaling_type;
};

class PhiRotaryEmbedding {
public:
    static PhiRotaryEmbedding new(DType dtype, const Config& cfg, const Device& dev) {
        std::optional<ScaledRopeParams> scaled_params;
        // Initialize scaled_params if cfg.rope_scaling has value
        if (cfg.rope_scaling) {
            scaled_params = ScaledRopeParams{
                cfg.rope_scaling->short_factor,
                cfg.rope_scaling->long_factor,
                cfg.rope_scaling->scaling_type
            };
        }
        int max_seq_len = cfg.max_position_embeddings;
        int dim = cfg.head_dim();

        if (scaled_params) {
            // Calculate scale
            double scale = static_cast<double>(cfg.max_position_embeddings) / static_cast<double>(cfg.original_max_position_embeddings);
            double scaling_factor = (scale <= 1.0) ? 1.0 : 
                (scaled_params->scaling_type == ScaledRopeType::Su) ?
                    std::sqrt(1.0 + std::log(scale) / std::log(static_cast<double>(cfg.original_max_position_embeddings))) : 
                    (0.1 * std::log(scale) + 1.0);

            // Calculate inv freqs for short, long
            std::vector<float> inv_freq_long;
            inv_freq_long.reserve(dim / 2);
            for (int i = 0; i < dim; i += 2) {
                inv_freq_long.push_back(1.0f / (scaled_params->long_factor[i / 2] * std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim))));
            }
            std::vector<float> inv_freq_short;
            inv_freq_short.reserve(dim / 2);
            for (int i = 0; i < dim; i += 2) {
                inv_freq_short.push_back(1.0f / (scaled_params->short_factor[i / 2] * std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim))));
            }
            int inv_freq_len = inv_freq_long.size();

            Tensor t = Tensor::arange(0u32, max_seq_len, dev).to_dtype(dtype).reshape({ max_seq_len, 1 });

            // Calculate sin,cos for long
            Tensor inv_freq_long_tensor = Tensor::from_vec(inv_freq_long, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor freqs_long = t.matmul(inv_freq_long_tensor);
            Tensor long_sin = freqs_long.sin().mul(scaling_factor);
            Tensor long_cos = freqs_long.cos().mul(scaling_factor);

            // Calculate sin,cos for short
            Tensor inv_freq_short_tensor = Tensor::from_vec(inv_freq_short, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor freqs_short = t.matmul(inv_freq_short_tensor);
            Tensor short_sin = freqs_short.sin().mul(scaling_factor);
            Tensor short_cos = freqs_short.cos().mul(scaling_factor);

            return PhiRotaryEmbedding(std::move(short_cos), std::move(short_sin), std::move(long_cos), std::move(long_sin), cfg.original_max_position_embeddings);
        } else {
            std::vector<float> inv_freq;
            inv_freq.reserve(dim / 2);
            for (int i = 0; i < dim; i += 2) {
                inv_freq.push_back(1.0f / std::pow(cfg.rope_theta, static_cast<float>(i) / static_cast<float>(dim)));
            }
            int inv_freq_len = inv_freq.size();
            Tensor inv_freq_tensor = Tensor::from_vec(inv_freq, { 1, inv_freq_len }, dev).to_dtype(dtype);
            Tensor t = Tensor::arange(0u32, max_seq_len, dev).to_dtype(dtype).reshape({ max_seq_len, 1 });
            Tensor freqs = t.matmul(inv_freq_tensor);
            Tensor sin = freqs.sin();
            Tensor cos = freqs.cos();
            return PhiRotaryEmbedding(std::move(cos), std::move(sin), std::nullopt, std::nullopt, cfg.original_max_position_embeddings);
        }
    }

    std::pair<const Tensor&, const Tensor&> get_long_or_short_sin_cos(const std::vector<size_t>& seqlen_offsets) const {
        if (!long_cos.has_value()) {
            return { short_sin, short_cos };
        }
        size_t seq_len = *std::max_element(seqlen_offsets.begin(), seqlen_offsets.end()) + 1;
        if (seq_len > original_max_position_embeddings) {
            return { *long_sin, *long_cos };
        } else {
            return { short_sin, short_cos };
        }
    }

    std::pair<Tensor, Tensor> forward(const Tensor& q, const Tensor& k, const std::vector<size_t>& seqlen_offsets) const {
        auto [_b_sz, _h, seq_len, _n_embd] = q.dims4();
        std::vector<Tensor> q_embeds;
        std::vector<Tensor> k_embeds;
        auto [sin, cos] = get_long_or_short_sin_cos(seqlen_offsets);
        for (auto offset : seqlen_offsets) {
            Tensor cos_chunk = cos.narrow(0, offset, seq_len);
            Tensor sin_chunk = sin.narrow(0, offset, seq_len);
            Tensor q_embed = candle_nn::rotary_emb::rope(q.contiguous(), cos_chunk, sin_chunk);
            Tensor k_embed = candle_nn::rotary_emb::rope(k.contiguous(), cos_chunk, sin_chunk);
            q_embeds.push_back(std::move(q_embed));
            k_embeds.push_back(std::move(k_embed));
        }
        return { Tensor::cat(q_embeds, 0), Tensor::cat(k_embeds, 0) };
    }

private:
    Tensor short_cos;
    Tensor short_sin;
    std::optional<Tensor> long_cos;
    std::optional<Tensor> long_sin;
    int original_max_position_embeddings;

    PhiRotaryEmbedding(Tensor short_cos, Tensor short_sin, std::optional<Tensor> long_cos, std::optional<Tensor> long_sin, int original_max_position_embeddings)
        : short_cos(std::move(short_cos)), short_sin(std::move(short_sin)), long_cos(std::move(long_cos)), long_sin(std::move(long_sin)),
          original_max_position_embeddings(original_max_position_embeddings) {}
};

Here are the main changes and improvements made:

Used enum class instead of enum for better type safety. Moved the PhiRotaryEmbedding struct to a class with private members and a private constructor. This encapsulates the internal state and provides better control over object creation. Used std::optional instead of std::optional for consistency. Used std::sqrt instead of manually calculating the square root. Used std::vector::reserve to preallocate memory for vectors, improving performance. Used std::move to efficiently move temporary objects when constructing and returning PhiRotaryEmbedding objects. Used structured bindings to unpack tuples for improved readability.

These changes make the code more idiomatic, efficient, and easier to maintain in C++. Please note that this code assumes the existence of a 'Tensor' class and related utilities (candle_nn::rotary_emb::rope, Config, Device, DType). Make sure to include the necessary headers and dependencies for these components.

Hoping this helps in some way!!

EricLBuehler commented 4 months ago

@dmsweetser, @maxrubin629 I noticed that you referenced our Phi RoPE code from mistral.rs. We recently fixed a bug in EricLBuehler/mistral.rs#251 which fixes the switching between short and long RoPE scalings.

This bug was that the sequence offsets (seqlen_offsets) were used to determine sequence length instead of the correct way, which is with position ids. Specifically, the error occurred when the prompt length is greater than original_max_position_embeddings. In this case, the short scalings are used instead of the long scalings which should be used.

The fix is to not use the sequence offsets. Instead, we directly pass the position ids. Here is an example of our fix.

Jacck commented 4 months ago

I am getting warning: 'Phi3ForCausalLM' is not supported for text-generation, when running Phi-3-mini-128k-instruct locally. Is Phi3ForCausalLM going to be supoorted?

foldl commented 4 months ago

FYI: implementation of LongRoPE:

https://github.com/foldl/chatllm.cpp/blob/572448ff071278d783d39d7463b471db90ce73c9/custom_ops.cpp#L586

My tests show that we can't dynamically switch to long_factor from short_factor.

mzwing commented 4 months ago

I am getting warning: 'Phi3ForCausalLM' is not supported for text-generation, when running Phi-3-mini-128k-instruct locally. Is Phi3ForCausalLM going to be supoorted?

Sounds quite strange.

See here: https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py#L2080

It has been supported already. Maybe just because you do not update your llama.cpp :)

flatsiedatsie commented 4 months ago

Microsoft says they'll look into whether they can help get the 128K version implemented in Llama.cpp: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/48

DataBassGit commented 4 months ago

They closed that issue. I doubt they will follow up.

EricLBuehler commented 4 months ago

I have a working implementation for LongRoPE here supporting Phi-3 128k. It is written in Rust, but the key modifications are:

1) Calculating the correct scaling factor (su vs yarn) https://github.com/EricLBuehler/mistral.rs/blob/097ac3d142fb2123caba523ffffa9c0da719acc8/mistralrs-core/src/layers.rs#L109-L118 2) Multiplying the inverse freq by the long/short scaling factor https://github.com/EricLBuehler/mistral.rs/blob/097ac3d142fb2123caba523ffffa9c0da719acc8/mistralrs-core/src/layers.rs#L125-L126 3) Applying the scaling factor to the computed cos/sin caches https://github.com/EricLBuehler/mistral.rs/blob/097ac3d142fb2123caba523ffffa9c0da719acc8/mistralrs-core/src/layers.rs#L146 4) Switch use of long vs short cos/sin caches at runtime depending on the sequence length https://github.com/EricLBuehler/mistral.rs/blob/097ac3d142fb2123caba523ffffa9c0da719acc8/mistralrs-core/src/layers.rs#L190-L191

Perhaps this would be useful?

foldl commented 4 months ago

@EricLBuehler Have you tested the behavior of switching from short_factor to long_factor, does it work or not? My tests show that it doesn't. See also this discussion on huggingface.

By switching, I mean that 4k-token boundary is crossed in a multi-turn conversation.

I have used max_length to select which factor is used, where max_length can be modified from command line: https://github.com/foldl/chatllm.cpp/blob/572448ff071278d783d39d7463b471db90ce73c9/layers.cpp#L700C5-L705C6

EricLBuehler commented 4 months ago

@foldl, our implementation (which you can try out here) of the switching does work, and we tested it with our Llama index integration.

Before we correctly implemented the switching logic switched, we were getting garbage output once the model went over 4096 tokens? Like they mention in the discussion you linked, there can be some issues around the 4k boundary, but the model recovers like the HF impl. We switch when the sequence length crosses 4096 tokens.

With your implementation, does the model begin generating gibberish when it crosses 4096?

foldl commented 4 months ago

@EricLBuehler IMO, issues around the 4k boundary even if it can recover, is unacceptable.

EricLBuehler commented 4 months ago

@foldl, yes, it is unacceptable. However, we also see this behavior with the HF impl so perhaps it is a property of the model?

foldl commented 4 months ago

It can hardly be called as a feature, 😃 . Let's see how they will fix it.

Meanwhile, I would not duplicate the buggy behavior of their HF code, and prefer using long_factor from the beginning if more than 4k tokens are required.