bigcode-project / bigcode-evaluation-harness

A framework for the evaluation of autoregressive code generation language models.
Apache License 2.0
744 stars 193 forks source link

Fix LLaMA Evaluations #81

Closed sedrickkeh closed 1 year ago

sedrickkeh commented 1 year ago

The current evaluation pipeline returns all 0.0 for LLaMA (#74). Upon inspecting the generated outputs, it seems that the model is generating incoherent tokens.

There are two reasons why LLaMA fails:

1. Padding

For models such as StarCoder, the padding is done to the right. However, for LLaMA, the padding seems to be done to the left. This will largely affect generation because the way the inputs are parsed assumes right-padding input_ids=batch["ids"][:, : batch["input_len"]] . If padding is done incorrectly, this will result in the actual prompt being truncated rather than the padding tokens being truncated https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/utils.py#L136

Solution: Explicitly enforce padding_side="right" in the tokenizer.

Result: The model no longer generates incoherent tokens. However, the model still scores 0.0. Upon inspection, I found that this was because there would always be an extra comment marker """ before each generation. For example:

def func(x):
    """ Some comment here
    Comment line 2
    """ """
    return x

Why is this? See the point below.

2. bos_token (\<s>)

from transformers import AutoTokenizer

llama_tokenizer = tokenizer.from_pretrained('huggyllama/llama-7b')
llama_tokenizer("hi")
>>> {'input_ids': [1, 7251], ...}

starcoder_tokenizer = tokenizer.from_pretrained('bigcode/starcoder')
starcoder_tokenizer("hi")
>>> {'input_ids': [4980], ...}

This causes a problem because during generation, it enters the skip_special_tokens=False conditional (https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/utils.py#L182-L185), which means that in the postprocessing line generation = generation[len(prompt) :], then len(prompt) will be 3 characters short, thus leaving the extra comment markers at the start of the generation.

Solution: There are potentially many different ways to fix it. Some solutions include: fix it at the start (remove bos_tokens on the tokenizer side), fix it during decoding time (skip_special_tokens=True) if there is a BOS token, etc. I will defer to the mods to decide on this.

Result: After fixing, I am able to get LLaMA performance on HumanEval pass@1 to be around ~0.1 level, which matches what is expected.

sedrickkeh commented 1 year ago

For reference, I am using huggyllama/llama-7b, although the issues I mentioned mostly involve the LlamaTokenizer in HF, so any LLaMA-based model that uses the tokenizer likely also faces the same issues.

loubnabnl commented 1 year ago

Thanks for diving into this and for the excellent explanation! I suggest we add a condition to make sure no bos_token is kept after decoding. We can add something like this here

            if INFILL_MODE or tokenizer.eos_token in task.stop_words:
                # remove bos_token if it was added
                if s[0] == tokenizer.bos_token_id:
                    s = s[1:]
                gen_code = tokenizer.decode(
                    s, skip_special_tokens=False, clean_up_tokenization_spaces=False
                )
sedrickkeh commented 1 year ago

Great, ok! I've made the changes accordingly.

mnoukhov commented 1 year ago

Thanks for the fix @sedrickkeh !