QwenLM / CodeQwen1.5

CodeQwen1.5 is the code version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.
371 stars 22 forks source link

【Bug】In FIM mode, an extra space is added at the beginning of the first line. #75

Closed liuzhenghua closed 1 month ago

liuzhenghua commented 1 month ago

Description

In FIM mode, an extra space is added at the beginning of the first line if is ends with \n.

How to repeat

from transformers import AutoTokenizer, AutoModelForCausalLM
# load model
device = "cuda" # the device to load the model onto

tokenizer = AutoTokenizer.from_pretrained("Qwen/CodeQwen1.5-7B")
model = AutoModelForCausalLM.from_pretrained("Qwen/CodeQwen1.5-7B", device_map="auto").eval()

input_text = """<fim_prefix>def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
<fim_suffix>
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)<fim_middle>"""

model_inputs = tokenizer([input_text], return_tensors="pt").to(device)

# Use `max_new_tokens` to control the maximum output length.
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=False)[0]
# The generated_ids include prompt_ids, we only need to decode the tokens after prompt_ids.
output_text = tokenizer.decode(generated_ids[len(model_inputs.input_ids[0]):], skip_special_tokens=True)

print(f"Prompt: {input_text}\n\nGenerated text: {output_text}")

When executing the above code, ensure that ends with a blank line. If the indentation of the whitespace characters is incorrect, an extra space will always be added to the first line of the resulting code.

cyente commented 1 month ago

This situation does occur sometimes, and we have also noticed this phenomenon where there sometimes is an extra space after <fim_middle> (not <fim_prefix> tokens). Currently, a relatively convenient solution is to try to avoid this situation through pre-processing and post-processing, especially to prevent indent at the first word during inference.

liuzhenghua commented 1 month ago

An extra space after would cause another problem that could be solved by pre-processing. This issue can't be handled during pre-processing because the content is written by users, and we can't control whether a new line is forbidden after the end of a line in the user's IDE. Currently, we drop a blank space in post-processing if the completion starts with a blank space as a workaround, but this is not an ideal solution. If does not end with a '\n', the problem would not occur.

liuzhenghua commented 1 month ago

image

liuzhenghua commented 1 month ago

image

liuzhenghua commented 1 month ago

fim prefix ending with corrent indent or not with \n,thethe problem would not occured image image

cha0s commented 1 month ago

I'm getting this problem using Continue in vscode when trying to use this model. The problem does not occur with e.g. codellama.