Closed sedrickkeh closed 1 year ago
For reference, I am using huggyllama/llama-7b
, although the issues I mentioned mostly involve the LlamaTokenizer
in HF, so any LLaMA-based model that uses the tokenizer likely also faces the same issues.
Thanks for diving into this and for the excellent explanation! I suggest we add a condition to make sure no bos_token
is kept after decoding. We can add something like this here
if INFILL_MODE or tokenizer.eos_token in task.stop_words:
# remove bos_token if it was added
if s[0] == tokenizer.bos_token_id:
s = s[1:]
gen_code = tokenizer.decode(
s, skip_special_tokens=False, clean_up_tokenization_spaces=False
)
Great, ok! I've made the changes accordingly.
Thanks for the fix @sedrickkeh !
The current evaluation pipeline returns all 0.0 for LLaMA (#74). Upon inspecting the generated outputs, it seems that the model is generating incoherent tokens.
There are two reasons why LLaMA fails:
1. Padding
For models such as StarCoder, the padding is done to the right. However, for LLaMA, the padding seems to be done to the left. This will largely affect generation because the way the inputs are parsed assumes right-padding
input_ids=batch["ids"][:, : batch["input_len"]]
. If padding is done incorrectly, this will result in the actual prompt being truncated rather than the padding tokens being truncated https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/utils.py#L136Solution: Explicitly enforce
padding_side="right"
in the tokenizer.Result: The model no longer generates incoherent tokens. However, the model still scores 0.0. Upon inspection, I found that this was because there would always be an extra comment marker
"""
before each generation. For example:Why is this? See the point below.
2. bos_token (\<s>)
This causes a problem because during generation, it enters the
skip_special_tokens=False
conditional (https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/lm_eval/utils.py#L182-L185), which means that in the postprocessing linegeneration = generation[len(prompt) :]
, thenlen(prompt)
will be 3 characters short, thus leaving the extra comment markers at the start of the generation.Solution: There are potentially many different ways to fix it. Some solutions include: fix it at the start (remove bos_tokens on the tokenizer side), fix it during decoding time (
skip_special_tokens=True
) if there is a BOS token, etc. I will defer to the mods to decide on this.Result: After fixing, I am able to get LLaMA performance on HumanEval pass@1 to be around ~0.1 level, which matches what is expected.