EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

https://www.eleuther.ai

MIT License

5.8k stars 1.54k forks source link

wikitext weird results Mistral-7B-v0.1 length=4096 // Gemma-7B bos missing #1471

Open vince62s opened 4 months ago

vince62s commented 4 months ago

The first wto look ok, the last one is weird, are you getting the same ?

hf-auto (pretrained=mistralai/Mistral-7B-Instruct-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	9.8183	±	N/A
		none	None	byte_perplexity	1.5329	±	N/A
		none	None	bits_per_byte	0.6163	±	N/A

hf-auto (pretrained=meta-llama/Llama-2-7b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	8.7921	±	N/A
		none	None	byte_perplexity	1.5016	±	N/A
		none	None	bits_per_byte	0.5865	±	N/A

hf-auto (pretrained=mistralai/Mistral-7B-v0.1), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	17.9574	±	N/A
		none	None	byte_perplexity	1.7161	±	N/A
		none	None	bits_per_byte	0.7792	±	N/A

haileyschoelkopf commented 4 months ago

Can confirm I can replicate this--I don't know what the cause could be for Mistral to behave this way but this appears to be correct. Perhaps late in mistral training the data distribution is very different from wikitext? I don't know why the instruct model would then have such a lower perplexity though.

vince62s commented 4 months ago

no no it's a long context issue with Mistral. With max_length 512 I am getting: hf (pretrained=mistralai/Mistral-7B-Instruct-v0.2,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	13.4155	±	N/A
		none	None	byte_perplexity	1.6251	±	N/A
		none	None	bits_per_byte	0.7005	±	N/A

hf (pretrained=mistralai/Mistral-7B-v0.1,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	10.7703	±	N/A
		none	None	byte_perplexity	1.5597	±	N/A
		none	None	bits_per_byte	0.6412	±	N/A

cc: @ArthurZucker

vince62s commented 4 months ago

btw I am getting crazy ppl with Gemma-7b, see here: https://github.com/huggingface/transformers/issues/29250#issuecomment-1983639346

haileyschoelkopf commented 4 months ago

Assuming you're using the most up-to-date codebase version (after add_bos_token was introduced), maybe there is a problem with how the chunking in loglikelihood_rolling interacts with that option, in that chunks after the first don't have a BOS token and this messes Gemma up?

vince62s commented 4 months ago

there is definitely something related to this, when patching like this:

    # Special handling for first window: predict all tokens
    first_seq_len = min(max_seq_len, len(token_list))
    yield ([prefix_token] + [2] + token_list[: first_seq_len - 1], token_list[:first_seq_len])
    predicted += first_seq_len

    while predicted < len(token_list):
        window_pred_len = min(len(token_list) - predicted, pred_len)
        window_end = predicted + window_pred_len

        yield (
            [2] + token_list[window_end - max_seq_len - 1 : window_end - 2],
            [2] + token_list[window_end - window_pred_len : window_end - 1],
        )
        predicted += window_pred_len

I am getting: hf (pretrained=google/gemma-7b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1	Tasks	Version	Filter	n-shot	Metric	Value
wikitext	2	none	None	word_perplexity	20.1880	±	N/A
		none	None	byte_perplexity	1.7541	±	N/A
		none	None	bits_per_byte	0.8107	±	N/A

still far from Llama-7b / Mistral-7B for max_length=512, but also it still goes OOM with default max_length. Note I added the token_id=2 even for the first chunk because it seems not to be added.

haileyschoelkopf commented 3 months ago

1588 is a relevant PR to this issue!

1565 would be the ideal way to fix this issue cleanly for merge--I think that for the default behavior, we probably want to keep as is, but that we should allow users to specify `--gen_kwargs` for loglikelihood_rolling tasks (`--request_kwargs`?) and within that one could specify that each rolling token window should start with BOS.

LMK if this sounds good to you! If you'd like to contribute any of this upstream lmk, if not I will try to get to this feature relatively soon!

l2002924700 commented 1 month ago

The first wto look ok, the last one is weird, are you getting the same ?

hf-auto (pretrained=mistralai/Mistral-7B-Instruct-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8

Tasks Version Filter n-shot Metric Value Stderr wikitext 2 none None word_perplexity 9.8183 ± N/A none None byte_perplexity 1.5329 ± N/A none None bits_per_byte 0.6163 ± N/A hf-auto (pretrained=meta-llama/Llama-2-7b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr wikitext 2 none None word_perplexity 8.7921 ± N/A none None byte_perplexity 1.5016 ± N/A none None bits_per_byte 0.5865 ± N/A hf-auto (pretrained=mistralai/Mistral-7B-v0.1), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8

Tasks Version Filter n-shot Metric Value Stderr wikitext 2 none None word_perplexity 17.9574 ± N/A none None byte_perplexity 1.7161 ± N/A none None bits_per_byte 0.7792 ± N/A

hello, @vince62s ,could you please share "the url of dataset of wikitext". I got a wired test results with wikitext dataset. I want to check it. thank you.