Open vince62s opened 4 months ago
Can confirm I can replicate this--I don't know what the cause could be for Mistral to behave this way but this appears to be correct. Perhaps late in mistral training the data distribution is very different from wikitext? I don't know why the instruct model would then have such a lower perplexity though.
no no it's a long context issue with Mistral. With max_length 512 I am getting: hf (pretrained=mistralai/Mistral-7B-Instruct-v0.2,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 | Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | None | word_perplexity | 13.4155 | ± | N/A | |
none | None | byte_perplexity | 1.6251 | ± | N/A | |||
none | None | bits_per_byte | 0.7005 | ± | N/A |
hf (pretrained=mistralai/Mistral-7B-v0.1,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 | Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | None | word_perplexity | 10.7703 | ± | N/A | |
none | None | byte_perplexity | 1.5597 | ± | N/A | |||
none | None | bits_per_byte | 0.6412 | ± | N/A |
cc: @ArthurZucker
btw I am getting crazy ppl with Gemma-7b, see here: https://github.com/huggingface/transformers/issues/29250#issuecomment-1983639346
Assuming you're using the most up-to-date codebase version (after add_bos_token
was introduced), maybe there is a problem with how the chunking in loglikelihood_rolling
interacts with that option, in that chunks after the first don't have a BOS token and this messes Gemma up?
there is definitely something related to this, when patching like this:
# Special handling for first window: predict all tokens
first_seq_len = min(max_seq_len, len(token_list))
yield ([prefix_token] + [2] + token_list[: first_seq_len - 1], token_list[:first_seq_len])
predicted += first_seq_len
while predicted < len(token_list):
window_pred_len = min(len(token_list) - predicted, pred_len)
window_end = predicted + window_pred_len
yield (
[2] + token_list[window_end - max_seq_len - 1 : window_end - 2],
[2] + token_list[window_end - window_pred_len : window_end - 1],
)
predicted += window_pred_len
I am getting: hf (pretrained=google/gemma-7b,max_length=512), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 | Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | None | word_perplexity | 20.1880 | ± | N/A | |
none | None | byte_perplexity | 1.7541 | ± | N/A | |||
none | None | bits_per_byte | 0.8107 | ± | N/A |
still far from Llama-7b / Mistral-7B for max_length=512, but also it still goes OOM with default max_length. Note I added the token_id=2 even for the first chunk because it seems not to be added.
--gen_kwargs
for loglikelihood_rolling tasks (--request_kwargs
?) and within that one could specify that each rolling token window should start with BOS.LMK if this sounds good to you! If you'd like to contribute any of this upstream lmk, if not I will try to get to this feature relatively soon!
The first wto look ok, the last one is weird, are you getting the same ?
hf-auto (pretrained=mistralai/Mistral-7B-Instruct-v0.2), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
Tasks Version Filter n-shot Metric Value Stderr wikitext 2 none None word_perplexity 9.8183 ± N/A none None byte_perplexity 1.5329 ± N/A none None bits_per_byte 0.6163 ± N/A hf-auto (pretrained=meta-llama/Llama-2-7b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
Tasks Version Filter n-shot Metric Value Stderr wikitext 2 none None word_perplexity 8.7921 ± N/A none None byte_perplexity 1.5016 ± N/A none None bits_per_byte 0.5865 ± N/A hf-auto (pretrained=mistralai/Mistral-7B-v0.1), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
Tasks Version Filter n-shot Metric Value Stderr wikitext 2 none None word_perplexity 17.9574 ± N/A none None byte_perplexity 1.7161 ± N/A none None bits_per_byte 0.7792 ± N/A
hello, @vince62s ,could you please share "the url of dataset of wikitext". I got a wired test results with wikitext dataset. I want to check it. thank you.
The first wto look ok, the last one is weird, are you getting the same ?