bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
702 stars 180 forks source link

A common interface for APIs and Models. #161

Open Anindyadeep opened 7 months ago

Anindyadeep commented 7 months ago

Summary of the issue

First of all, thanks for the awesome effort for making code-evaluation-package. Highly appreciate it. However, right now, what I see is that it is integrated with just Huggingface models. It would be awesome, if we can evaluate the same for closed source models. For example something like this would be awesome:

accelerate launch  main.py \
  --model gpt-3.5-turbo \
  --max_length_generation 512 \
  --tasks instruct-humaneval \
  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 10 \
  --allow_code_execution

So, with the same interface and the post processing logic of code-evaluation-harness, we can leverage this, to evaluate and compare code-evaluation for open source and closed source models.

What is the motivation

The motivation behind this is that Open-Source models are all good. However, researchers and passionate people on LLM always strive for making models that can surpass gpt in lesser number of parameters and better in performance for certain tasks. And a library like this would be really helpful.

How can I contribute:

Well, I already have most part of this code ready. If you are aligned with the motivations of the issue, then I can create the PR. However, the problem, that I am facing is that the evaluation score for API based models are very low. For example gpt-3.5 is giving a score for 0.006 in HumanEval benchmark. However, the generation is correct. The problem is in indentation and the post-processing of the generations. For example, one instance of the generation of gpt-3.5.turbo looks something like this:

from typing import List

def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string.
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """    paren_string = paren_string.replace(' ', '')
    stack = []
    result = []
    group = ''
    for char in paren_string:
        if char == '(':
            if stack:
                group += char
            stack.append(char)
        elif char == ')':
            stack.pop()
            group += char
            if not stack:
                result.append(group)
                group = ''
    return result

If we see the above code, then we can see, the problem is in indentation, for which during the time of evaluation, we are getting it marked as wrong. Although I tried to implement the code of big-code's post processing for different tasks. But, it was not working. So I would highly appreciate in some help there.

loubnabnl commented 7 months ago

Hi, thanks for the suggestion, I think some challenges in the evaluation of these models are that they might change & evolve behind the API which makes the evaluation numbers not very relevant over time. They also might require different post-processing to extract the code snippet since they tend to generate natural text before and after so i'm not sure the current approach we have will work out of the box for most tasks.

However if you do tests and find your implementation to work/match public numbers for certain tasks like instruct-humaneval or others tasks like HumanEvalsynthesize then feel free to open a PR and we can consider adding this setup for a restricted set of benchmarks if it integrates well with the codebase.

Regarding your indentation issue, I think the prompt is stripped by default and doesn't have a \n at the end in instruct-humaneval. For humaneval task we have both humaneval and humaneval-unstripped tasks because we've noticed that GPT4's tokenizer and few others like Phind require keeping the last \n in the prompt to work properly. Can you try evaluation again while adding the \n? You can do that here or in the context of the task

Anindyadeep commented 7 months ago

I tried some of the things above mentioned, but everything just solved by giving a simple prompt. Does that make it a valid solution? For example for HumanEval, the problem solved when I added this prompt

# make a instruction
instruction_prompt = """
Complete the given code. First write whatever is given to you, followed by just completing the rest.
Ensure you have wrote the full function. Do not Write anything else other than completing the function.\n
"""

And the model, I considered was gpt-3.5-turbo for HumanEval.

loubnabnl commented 7 months ago

Maybe check this code that OctoCoder authors submitted for evaluation of OpenAI models on HumanEvalSynthesize https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack_openai.py

Anindyadeep commented 7 months ago

However if you do tests and find your implementation to work/match public numbers

Very interesting and weird thing, I used gpt-3.5-turbo, deterministic generation, pass@1 for HumanEval on that above prompt and it got me a score of 0.62, meanwhile codellama and Mistral paper states it was 48.1.

But just adding prompt + \n got me a result of 0.0016.

One reason for this could be the time codellama evaluated and the time I am doing the evaluation, gpt-3.5 got evolved over. And I am not sure whether they are contaminated with same examples or not.

ALLISWELL8 commented 7 months ago

@Anindyadeep Can you open source your projec.

Anindyadeep commented 7 months ago

@Anindyadeep Can you open source your projec.

Yeah, we will do that shortly :)

Anindyadeep commented 7 months ago

However if you do tests and find your implementation to work/match public numbers for certain tasks like instruct-humaneval or others tasks like HumanEvalsynthesize then feel free to open a PR and we can consider adding this setup for a restricted set of benchmarks if it integrates well with the codebase.

@loubnabnl I did not checked out instruct-humaneval but I did checked for humaneval, and the results were greater than the results from codellama implementation and very similar with the latest deepseek coder.

So, can you share is the below mentioned interface is okay, if I put the PR? Feel free to suggest me changes if any.

python3  main.py \
  --model gpt-3.5-turbo \
  --max_length_generation 512 \
  --tasks instruct-humaneval \
  --instruction_tokens <user_token>,<end_token>,<assistant_token>\
  --temperature 0.2 \
  --n_samples 200 \
  --batch_size 10 \
  --allow_code_execution
loubnabnl commented 7 months ago

Yes feel free to open a PR and add the scores you got

Anindyadeep commented 6 months ago

Hi @loubnabnl, I started a PR. Let me know which benchmarks, I need to evaluate through this so that I can add the results too.

Thanks