微调后用代码中的evaluation做humaneval评测时报错Failed to extract code block with error `list index out of range`:

deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself

https://coder.deepseek.com/

MIT License

6.01k stars 433 forks source link

微调后用代码中的evaluation做humaneval评测时报错Failed to extract code block with error `list index out of range`: #111

Closed mst272 closed 4 months ago

mst272 commented 5 months ago

>>> Task: Python/2
>>> Output:
def truncate_number(number: float) -> float:
    """ Given a positive floating point number, it can be decomposed into
    and integer part (largest integer smaller than given number) and decimals
    (leftover part always smaller than 1).

    Return the decimal part of the number.
    >>> truncate_number(3.5)
    0.5
    """
    return number - int(number)
<|EOT|>
Failed to extract code block with error `list index out of range`:

请问出现Failed to extract code block with error list index out of range`是什么原因呢

guoday commented 5 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 https://github.com/deepseek-ai/DeepSeek-Coder/commit/6590983bf05aa05fe61a9360f5d50360ad84980f

xiaokuixk commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。

DejianYang commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。 1）你的结果是正常的吗？如果只是个别case 报这个错，这只是个debug信息并不表示评测结果不正确 2）在 generate 前打印你的prompt 3）提供下你的transfomers 和tokenizers 的版本信息，我们看下是不是版本兼容问题导致的

mst272 commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。

这个问题貌似是代码块抽取不正确所导致，正常模型输出的output会有'''python '''的一个代码注释块包裹，如果输出没有的话需要自己后处理加上。

DejianYang commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。

这个问题貌似是代码块抽取不正确所导致，正常模型输出的output会有'''python '''的一个代码注释块包裹，如果输出没有的话需要自己后处理加上。

对的，所以如果你的最终结果是正常的，那么可以不用管。

xiaokuixk commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。 1）你的结果是正常的吗？如果只是个别case 报这个错，这只是个debug信息并不表示评测结果不正确 2）在 generate 前打印你的prompt 3）提供下你的transfomers 和tokenizers 的版本信息，我们看下是不是版本兼容问题导致的

谢谢你们的回复！我刚试了下：

测试结果不正常，pass@1=0；所有case都报错了
prompt和后面的输出如下： Please continue to complete the function. You are not allowed to modify the given code and do the completion only. Please return all completed function in a codeblock. Here is the given code to do completion:
```
import java.util.*;
import java.lang.reflect.*;
import org.javatuples.*;
import java.security.*;
import java.math.*;
import java.io.*;
import java.util.stream.*;
class Problem {
// Return length of given string
// >>> stringLength((""))
// (0l)
// >>> stringLength(("abc"))
// (3l)
public static long strlen(String string) {
```
Failed to extract code block with error list index out of range:

Task: HumanEval_23_strlen Output: return string.length(); }
transformers 4.37.2 tokenizers 0.15.1

DejianYang commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。 1）你的结果是正常的吗？如果只是个别case 报这个错，这只是个debug信息并不表示评测结果不正确 2）在 generate 前打印你的prompt 3）提供下你的transfomers 和tokenizers 的版本信息，我们看下是不是版本兼容问题导致的

谢谢你们的回复！我刚试了下：

测试结果不正常，pass@1=0；所有case都报错了

prompt和后面的输出如下： Please continue to complete the function. You are not allowed to modify the given code and do the completion only. Please return all completed function in a codeblock. Here is the given code to do completion:
import java.util.*;
import java.lang.reflect.*;
import org.javatuples.*;
import java.security.*;
import java.math.*;
import java.io.*;
import java.util.stream.*;
class Problem {
    // Return length of given string
    // >>> stringLength((""))
    // (0l)
    // >>> stringLength(("abc"))
    // (3l)
    public static long strlen(String string) {
Failed to extract code block with error list index out of range:

Task: HumanEval_23_strlen Output: return string.length(); }

transformers 4.37.2 tokenizers 0.15.1

输入的prompt应该是下面的格式：

You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
### Instruction:
{Question}
### Response:

请用下面的代码generate输入的prompt符合上述格式。

tokenizer.apply_chat_template([{'role': 'user', 'content': prompt }], tokenize=False, add_generation_prompt=True)

还有你的python 测试正常吗还是所有的语言都不正常，使用的model 是deepseek-coder吗（非deepseek-llm）

xiaokuixk commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。 1）你的结果是正常的吗？如果只是个别case 报这个错，这只是个debug信息并不表示评测结果不正确 2）在 generate 前打印你的prompt 3）提供下你的transfomers 和tokenizers 的版本信息，我们看下是不是版本兼容问题导致的

谢谢你们的回复！我刚试了下：

测试结果不正常，pass@1=0；所有case都报错了

prompt和后面的输出如下： Please continue to complete the function. You are not allowed to modify the given code and do the completion only. Please return all completed function in a codeblock. Here is the given code to do completion:
import java.util.*;
import java.lang.reflect.*;
import org.javatuples.*;
import java.security.*;
import java.math.*;
import java.io.*;
import java.util.stream.*;
class Problem {
    // Return length of given string
    // >>> stringLength((""))
    // (0l)
    // >>> stringLength(("abc"))
    // (3l)
    public static long strlen(String string) {
Failed to extract code block with error list index out of range:

Task: HumanEval_23_strlen Output: return string.length(); }

transformers 4.37.2 tokenizers 0.15.1
输入的prompt应该是下面的格式：
You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
### Instruction:
{Question}
### Response:
请用下面的代码generate输入的prompt符合上述格式。
tokenizer.apply_chat_template([{'role': 'user', 'content': prompt }], tokenize=False, add_generation_prompt=True)
还有你的python 测试正常吗还是所有的语言都不正常，使用的model 是deepseek-coder吗（非deepseek-llm）

prompt是https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/Evaluation/HumanEval/eval_instruct.py这个里面的我没有修改，是要把它修改成你发的这个吗？我用的model是微调后的deepseek-ai/deepseek-coder-6.7b-instruct。 python我正在试，目前看来在报同样的错。

DejianYang commented 4 months ago

我用的model是微调后的deepseek-ai/deepseek-coder-6.7b-instruct。 pass@1=0 肯定是格式不正确导致的，请确认下你的代码是最新的尤其是eval_instruct.py里下面的代码

def generate_one(example, lang, tokenizer, model): prompt = build_deepseekcoder_instruction(languge_settings[lang]['full_name'], example['prompt']) inputs = tokenizer.apply_chat_template( [{'role': 'user', 'content': prompt }], return_tensors="pt", add_generation_prompt=True ).to(model.device)

stop_id = tokenizer.convert_tokens_to_ids("<|EOT|>")
assert isinstance(stop_id, int), "Invalid tokenizer, EOT id not found"

outputs = model.generate(
    inputs, 
    max_new_tokens=1024,
    do_sample=False,
    # top_p=0.95,
    # temperature=temperature,
    pad_token_id=stop_id,
    eos_token_id=stop_id
)

output = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
example['output'] = output

return extract_generation_code(example, lang_code=lang)

可以打印下输入模型的输入

print(tokenizer.apply_chat_template([{'role': 'user', 'content': prompt }], tokenize=False, add_generation_prompt=True))

mst272 commented 4 months ago

新的transformers版本更新了接口，请确保你本地的deepseek-coder repo是最新的。 6590983

你好，我下载了最新的repo，但还是出现了同样的问题。 1）你的结果是正常的吗？如果只是个别case 报这个错，这只是个debug信息并不表示评测结果不正确 2）在 generate 前打印你的prompt 3）提供下你的transfomers 和tokenizers 的版本信息，我们看下是不是版本兼容问题导致的

谢谢你们的回复！我刚试了下：

测试结果不正常，pass@1=0；所有case都报错了

prompt和后面的输出如下： Please continue to complete the function. You are not allowed to modify the given code and do the completion only. Please return all completed function in a codeblock. Here is the given code to do completion:
import java.util.*;
import java.lang.reflect.*;
import org.javatuples.*;
import java.security.*;
import java.math.*;
import java.io.*;
import java.util.stream.*;
class Problem {
    // Return length of given string
    // >>> stringLength((""))
    // (0l)
    // >>> stringLength(("abc"))
    // (3l)
    public static long strlen(String string) {
Failed to extract code block with error list index out of range:

Task: HumanEval_23_strlen Output: return string.length(); }

transformers 4.37.2 tokenizers 0.15.1
输入的prompt应该是下面的格式：
You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.
### Instruction:
{Question}
### Response:
请用下面的代码generate输入的prompt符合上述格式。
tokenizer.apply_chat_template([{'role': 'user', 'content': prompt }], tokenize=False, add_generation_prompt=True)
还有你的python 测试正常吗还是所有的语言都不正常，使用的model 是deepseek-coder吗（非deepseek-llm）
prompt是https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/Evaluation/HumanEval/eval_instruct.py这个里面的我没有修改，是要把它修改成你发的这个吗？我用的model是微调后的deepseek-ai/deepseek-coder-6.7b-instruct。 python我正在试，目前看来在报同样的错。

这个问题的我已经告诉你了，见之前的回复，你看你生成的json文件中的output，代码块中没有...python ...的代码块符号，导致评测时无法提取代码，所以pass@1得分为0.

mst272 commented 4 months ago

现在我将关闭此问题了

xiaokuixk commented 4 months ago

稍等下，刚跑的python的human eval出结果了，output里面也没有代码块，但是测评结果是正确的，pass@1不等于0。我觉得需要确认下是否确实是这个问题导致微调后的代码做Java humaneval异常？请问有跑过java吗，是否出现了同样的问题呢？

另外，我检查了eval_instruct.py下的代码，是最新的，且有 @DejianYang 加粗的这行代码。print出来的inputs是这样的：

<｜begin▁of▁sentence｜>You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer

Instruction:

Please continue to complete the function. You are not allowed to modify the given code and do the completion only. Please return all completed function in a codeblock. Here is the given code to do completion:

import java.util.*;
import java.lang.reflect.*;
import org.javatuples.*;
import java.security.*;
import java.math.*;
import java.io.*;
import java.util.stream.*;
class Problem {
    // Return length of given string
    // >>> stringLength((""))
    // (0l)
    // >>> stringLength(("abc"))
    // (3l)
    public static long strlen(String string) {

Response:

以下是报错信息和output Failed to extract code block with error list index out of range:

Task: HumanEval_23_strlen Output: return string.length(); }

RoacherM commented 3 months ago

>>> Task: Python/2
>>> Output:
def truncate_number(number: float) -> float:
    """ Given a positive floating point number, it can be decomposed into
    and integer part (largest integer smaller than given number) and decimals
    (leftover part always smaller than 1).

    Return the decimal part of the number.
    >>> truncate_number(3.5)
    0.5
    """
    return number - int(number)
<|EOT|>
Failed to extract code block with error `list index out of range`:

请问出现Failed to extract code block with error list index out of range`是什么原因呢

修改下Evaluation/Humaneval/utils/utils.py中extract_generation_code部分的代码，加一个wrapper。

def extract_generation_code(example: str, lang_code: str, verbose: bool=False):
    task_id = example['task_id']
    output = example.get('output', example.get("gpt_completion"))
    question = example["prompt"].strip()
    setting = languge_settings[lang_code]
    lang = setting['full_name']
    indent = setting['indent']

    # 检查输出中是否包含 ```python 的代码块
    if f'```{lang.lower()}' not in output:
        # 如果没有找到代码块,手动添加代码块标记
        output = f'```{lang.lower()}\n{output}\n```'

    try:
        code_block: str = re.findall(f'```{lang.lower()}(.*?)```', output, re.DOTALL | re.IGNORECASE)[0]
        if verbose:
            print(">>> Task: {}\n{}".format(task_id, code_block))

        # 移除 main 函数
        if setting.get('main', None) and setting['main'] in code_block:
            main_start = code_block.index(setting['main'])
            code_block = code_block[:main_start]

        func_name, func_prefix = get_function_name(question, lang)

        try:
            start = code_block.lower().index(func_name.lower())
            indent = 0
            while start - indent >= 0 and code_block[start - indent-1] == ' ':
                indent += 1

            try:
                end = code_block.rindex('\n' + ' '*indent + '}')
            except:
                end = len(code_block)
        except:
            start = 0
            try:
                end = code_block.rindex('\n' + ' '*indent + '}')
            except:
                end = len(code_block)

        body = code_block[start:end]

        if lang_code.lower() in ['php', 'ts', 'js']:
            body += '\n' + ' '*indent + '}'

        generation = func_prefix + '\n' + body + '\n'
        example['generation'] = generation

    except Exception as ex:
        print("Failed to extract code block with error `{}`:\n>>> Task: {}\n>>> Output:\n{}".format(
            ex, task_id, output
        ))
        example['generation'] = example['prompt'] + '\n' + output

    return example