Reproducing BigCodeBench Scores

terryyz commented 3 months ago

Hi there,

We're trying to reproduce the scores reported on BigCodeBench using v0.1.7post2. As there is no chat template provided inside the HF tokenizer config, I slightly changed the code and used the default chat template. So far, I got 49.0 on Complete and 38.9 on Instruct. The reproduced Instruct performance is a bit lower than your reported one. I doubt you had a customized template during the evaluation. Could you share more details about your setup?

https://github.com/bigcode-project/bigcodebench/issues/19

Cheers

xinpeng-zhang commented 3 months ago

We get 48.9 on complet and 40.4 on instruct, so I guess u are certainly on the right track. The general chat template is detailed in System Prompt Guideline For bigcodebench, we added a customized prompt but nothing more. You could follow the structure below:

<|system|>
You are an intelligent programming assistant named CodeGeeX. You will answer any questions users have about programming, coding, and computers, and provide code that is formatted correctly, executable, accurate, and secure, and offer detailed explanations when necessary.<|user|>
Calculates the average of the sums of absolute differences between each pair of consecutive numbers for all permutations of a given list. Each permutation is shuffled before calculating the differences. Args: - numbers (list): A list of numbers. Default is numbers from 1 to 10.
The function should output with:
    float: The average of the sums of absolute differences for each shuffled permutation of the list.
You should write self-contained code starting with:

import itertools from random import shuffle def task_func(numbers=list(range(1, 3))):

Complete the code according to the requirement.
<|assistant|>

terryyz commented 3 months ago

Thanks for the clarification! Would you mind adding the chat template to the HF model tokenizer config file? @xinpeng-zhang It seems that @Stanislas0 doesn't have the bandwidth to do that. Otherwise, we'll update the BigCodeBench leaderboard with the reproduced scores.

xinpeng-zhang commented 2 months ago

The structure i gave you was a English version. If you are going to update the score to leaderboard, could you try it in chinese version. There is no need to change HF model tokenizer, the chinese system prompt is already in there. Here is a chinese version, and please try it in greedy, DO NOT use temperature and top_p sampling. Thank you!!!

<|system|>
你是一位智能编程助手，你叫CodeGeeX。你会为用户回答关于编程、代码、计算机方面的任何问题，并提供格式规范、可以执行 、准确安全的代码，并在必要时提供详细的解释。<|user|>
Calculates the average of the sums of absolute differences between each pair of consecutive numbers for all permutations of a given list. Each permutation is shuffled before calculating the differences. Args: - numbers (list): A list of numbers. Default is numbers from 1 to 10.
The function should output with:
    float: The average of the sums of absolute differences for each shuffled permutation of the list.
You should write self-contained code starting with:

import itertools from random import shuffle def task_func(numbers=list(range(1, 3))):

根据描述，完成代码
<|assistant|>

So if you are using HF model, the default system prompt is in chinese, all you need to do the add "\n根据描述，完成代码\n" after the instruct.

terryyz commented 2 months ago

Thanks @xinpeng-zhang! I'll get back to you when the new evaluation is done :)

terryyz commented 2 months ago

The reproduced score on Instruct is 40.0. I'll report this score since there is only a 0.4% difference.

Stanislas0 commented 2 months ago

The reproduced score on Instruct is 40.0. I'll report this score since there is only a 0.4% difference.

Hi Terry! I'm sorry for the late response. This score is very close to our experiments, and the difference might be due to some environment issues. It's ok to report the reproduced score. Thanks for your wonderful job on BigCodeBench!

terryyz commented 2 months ago

Thanks Qinkai!

The difference could be also due to the response content. I currently still use the English context to start the assistants response part 🙂

Cheers, Terry

On Tue, 9 Jul 2024 at 10:07 AM, Qinkai @.***> wrote:

The reproduced score on Instruct is 40.0. I'll report this score since there is only a 0.4% difference.

Hi Terry! I'm sorry for the late response. This score is very close to our experiments, and the difference might be due to some environment issues. It's ok to report the reproduced score. Thanks for your wonderful job on BigCodeBench!

— Reply to this email directly, view it on GitHub https://github.com/THUDM/CodeGeeX4/issues/11#issuecomment-2216233133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIULCHQARQDL2CEF7KUVX5TZLNAXJAVCNFSM6AAAAABKQMNWJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJWGIZTGMJTGM . You are receiving this because you authored the thread.Message ID: @.***>

THUDM / CodeGeeX4

Reproducing BigCodeBench Scores #11