Fill-in-the-middle support for CodeQwen

QwenLM / Qwen2.5-Coder

Qwen2.5-Coder is the code version of Qwen2.5, the large language model series developed by Qwen team, Alibaba Cloud.

3.13k stars 212 forks source link

Fill-in-the-middle support for CodeQwen #13

Closed sheepymeh closed 7 months ago

sheepymeh commented 7 months ago

Hi, I'd like to ask if CodeQwen has a token for fill-in-the-middle generation

huybery commented 7 months ago

Hi, you can refer to this example: https://github.com/QwenLM/CodeQwen1.5/blob/main/examples/CodeQwen1.5-base-fim.py

sheepymeh commented 7 months ago

Thank you very much! I was looking at the special tokens and didn't spot it as it is marked as special: false. Is this intentional? I expected the <fim_*> tokens to be labeled as special.

huybery commented 7 months ago

Don't care it, <fim_*> will be treated as a separate token by the model, and special: false is a configurable parameter that can be ignored.

sheepymeh commented 7 months ago

I see, thank you!

mechigonft commented 7 months ago

我是个大模型小白，请问在fill-in-the-middle场景中， fim_prefix fim_suffix fim_middle 这些特殊标记的作用是什么？哪个是让模型生成代码？除这些以外还有哪些特殊标记？可以在哪里查看？

sheepymeh commented 7 months ago

模型在一般情况下会自动生成输入以后的代码，用fim模式可以生成两半之间的代码，比如

    <fim_prefix>def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    <fim_suffix>
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)<fim_middle>

<fim_prefix> 是前半部分, <fim_suffix> 代表后半部分, 然后在输入的结尾用 <fim_middle> 提示模型生成中间的代码。

你可以在 tokenizers.json 文件里查看其他特殊标记