QwenLM / Qwen2.5-Coder

Qwen2.5-Coder is the code version of Qwen2.5, the large language model series developed by Qwen team, Alibaba Cloud.
821 stars 74 forks source link

【Bug】In FIM mode, an extra space is added at the beginning of the first line. #75

Closed liuzhenghua closed 5 months ago

liuzhenghua commented 5 months ago

Description

In FIM mode, an extra space is added at the beginning of the first line if is ends with \n.

How to repeat

from transformers import AutoTokenizer, AutoModelForCausalLM
# load model
device = "cuda" # the device to load the model onto

tokenizer = AutoTokenizer.from_pretrained("Qwen/CodeQwen1.5-7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/CodeQwen1.5-7B", device_map="auto").eval()

input_text = """<fim_prefix>def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
<fim_suffix>
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)<fim_middle>"""

model_inputs = tokenizer([input_text], return_tensors="pt").to(device)

# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=False)[0]
# The generated_ids include prompt_ids, we only need to decode the tokens after prompt_ids.
output_text = tokenizer.decode(generated_ids[len(model_inputs.input_ids[0]):], skip_special_tokens=True)

print(f"Prompt: {input_text}\n\nGenerated text: {output_text}")

When executing the above code, ensure that ends with a blank line. If the indentation of the whitespace characters is incorrect, an extra space will always be added to the first line of the resulting code.

cyente commented 5 months ago

This situation does occur sometimes, and we have also noticed this phenomenon where there sometimes is an extra space after <fim_middle> (not <fim_prefix> tokens). Currently, a relatively convenient solution is to try to avoid this situation through pre-processing and post-processing, especially to prevent indent at the first word during inference.

liuzhenghua commented 5 months ago

An extra space after would cause another problem that could be solved by pre-processing. This issue can't be handled during pre-processing because the content is written by users, and we can't control whether a new line is forbidden after the end of a line in the user's IDE. Currently, we drop a blank space in post-processing if the completion starts with a blank space as a workaround, but this is not an ideal solution. If does not end with a '\n', the problem would not occur.

liuzhenghua commented 5 months ago

image

liuzhenghua commented 5 months ago

image

liuzhenghua commented 5 months ago

fim prefix ending with corrent indent or not with \n,thethe problem would not occured image image

cha0s commented 5 months ago

I'm getting this problem using Continue in vscode when trying to use this model. The problem does not occur with e.g. codellama.