bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
702 stars 180 forks source link

Try to reproduce the leaderboard result of codellama-instruct-34B for Rust #172

Closed teamplayer3 closed 6 months ago

teamplayer3 commented 7 months ago

I try to reproduce the codellama instruct 34B result of the leaderboard for Rust. I get a value of around 25% (HumanevalPack instruction with code llama instruct prompt template), but the leaderboard says 39%.

For me, it's not clear what's the exact prompt you used. In the description of the leaderboard it is not very clear if you use the instruction from HumanevalPack or you only use the prompt from Multipl-e for instruction models, or if you use a prompt template for code llama. In general, I can't find the code llama prompt template in your framework.

Would be nice if you could give me a prompt example you feed into the model of the evaluation of codellama instruct 34B for Rust.

Thank you.

loubnabnl commented 7 months ago

Hi, we only use HumanEvalPack for instruction-tuned models for Python, for the rest of the 12 languages we use MultiPL-E as mentionned in the About section of the leaderboard

image
teamplayer3 commented 6 months ago

Thanks for responding.

But I understand your description a bit different. You say you precompute the HumanEval-Python tasks for the instruction models to fit in the prompt style. But in the next sentence you say you use the instructions from HumanEval-Pack for the instruction models. An example for such an instruction would be:

Write a Rust function `has_close_elements(numbers:Vec<f32>, threshold: f32) -> bool` to solve the following problem: Check if in given list of numbers, are any two numbers closer to each other than given threshold.

This does not align to your graphic of the prompt for instruction models. How does the instruction version look like?

And again, you have no prompt template for codellama instruct in the framework.

loubnabnl commented 6 months ago

In the beginning all models were evaluated on HumanEval for python and MultiPL-E for the other programming languages. We then switched Python for instruction models from HumanEval to HumanEvalSynthesize-Python (using this prompt for codellama) https://github.com/bigcode-project/bigcode-evaluation-harness/blob/be2a44c2faa29c20b5041d7083acb698eb373309/bigcode_eval/tasks/humanevalpack.py#L226

But we kept MultiPL-E prompt for all the other programming languages, be it base or instruction models. Maybe in the future we'll move to HumanEvalPack for the other languages too. Does that answer your question? EDIT: it seems the prompts link in About section was outdated, I've updated it here

teamplayer3 commented 6 months ago

Thanks for clarifying. So, for example, for the Rust Programming language, you call codellama without the prompt template, only with the plain prompt in the MultiPL-e dataset?

If I prompt codellame with a plain promt without prompt template, I often a get a result as the following.

Prompt

/// Concatenate vector of strings into a single string
/// >>> concatenate(vec![])
/// String::from("")
/// >>> concatenate(vec![String::from("a"), String::from("b"), String::from("c")])
/// String::from("abc")
fn concatenate(strings: Vec<String>) -> String {
assert_eq!(concatenate(vec![]), String::from(""));
        assert_eq!(
            concatenate(vec![String::from("a"), String::from("b"), String::from("c")]),
            String::from("abc")
        );
    }
fn test_concatenate() {
        assert_eq!(concatenate(vec![]), String::from(""));
        assert_eq!(
            concatenate(vec![String::from("a"), String::from("b"), String::from("c")]),
            String::from("abc")
        );
    }
loubnabnl commented 6 months ago

So, for example, for the Rust Programming language, you call codellama without the prompt template, only with the plain prompt in the MultiPL-e dataset?

that's correct! Did you try running the full benchmark to see what score you get

teamplayer3 commented 6 months ago

Yes, I run the full benchmark with codellama instruct 7b for now. I had a problem with parsing the output in the previous Output I showed. Now I got a score of 25. That aligns with your score. Thanks for helping.