Question about LLama3-8B-80K performance on LongBench dataset.

ZetangForward commented 2 days ago

Hi, I noticed in the tech report of LLama3-8B-80K that, the authors evaluate the vanilla LLama-8K-Instruct in the LongBench dataset with 8K context length, and obtain the following results:

And the evaluation code here uses the 31500 context length: https://github.com/FlagOpen/FlagEmbedding/blob/681f61562269ca55cbfa756d0f91027cb809e73f/Long_LLM/longllm_qlora/main/eval_longbench.py#L40

Which is the best setting (context length) to evaluate the model on the LongBench dataset? 8K or around 32K? I also wonder Is it fair to compare LLama3-8B-Instruct, which can only view 8K context length, to LLama3-8B-80K, which can view 32K context length? Does LLama3-8B-80K perform the same as a vanilla 8K context model if it has a context length of only 8K?

Then, I also tried the released LLama3-8B-80K model https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged, and add \n as eos token (https://github.com/FlagOpen/FlagEmbedding/blob/681f61562269ca55cbfa756d0f91027cb809e73f/Long_LLM/longllm_qlora/main/eval_longbench.py#L111). However, the generated results of narrativeqa subset become:

, where model can only generate "assistant\n"

Did the authors find such a phenomenon?

namespace-Pt commented 2 days ago

Hi,

We haven't tested the vanilla Llama-3-8b-instruct with 32K context length because it is designed with 8K context length.
This abnormal prediction is due to the chat template. The chat template of Llama-3 is <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nINSTRUCTION<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n where INSTRUCTION is the instruction. You are getting this wrong result because the model renders the \n that should be in the chat template as an eos token. Given the correct chat template, the result should be like the following, and the score is 25.97 given 8K length.

ZetangForward commented 1 day ago

LLama3-8B-80K

Thanks for your response. Regarding question 1, have you tested the performance of the LLama3-8B-80K with just an 8K context length? I ask because I've observed that the vanilla LLama-3-8K can achieve a high score, and I'm curious if scaling the context might affect its performance.

namespace-Pt commented 1 day ago

The result I posted above (narrativeqa score=25.97) is the Llama-3-80k model with 8k context only. You can test more tasks if interested.

FlagOpen / FlagEmbedding

Question about LLama3-8B-80K performance on LongBench dataset. #931