Open ZetangForward opened 2 days ago
Hi,
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nINSTRUCTION<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
where INSTRUCTION is the instruction. You are getting this wrong result because the model renders the \n
that should be in the chat template as an eos token. Given the correct chat template, the result should be like the following, and the score is 25.97 given 8K length.
LLama3-8B-80K
Thanks for your response. Regarding question 1, have you tested the performance of the LLama3-8B-80K with just an 8K context length? I ask because I've observed that the vanilla LLama-3-8K can achieve a high score, and I'm curious if scaling the context might affect its performance.
The result I posted above (narrativeqa score=25.97) is the Llama-3-80k model with 8k context only. You can test more tasks if interested.
Hi, I noticed in the tech report of LLama3-8B-80K that, the authors evaluate the vanilla LLama-8K-Instruct in the LongBench dataset with 8K context length, and obtain the following results:![image](https://github.com/FlagOpen/FlagEmbedding/assets/123983104/257e8c71-6465-43eb-b9a6-299598a08b86)
And the evaluation code here uses the 31500 context length: https://github.com/FlagOpen/FlagEmbedding/blob/681f61562269ca55cbfa756d0f91027cb809e73f/Long_LLM/longllm_qlora/main/eval_longbench.py#L40
Which is the best setting (context length) to evaluate the model on the LongBench dataset? 8K or around 32K? I also wonder Is it fair to compare LLama3-8B-Instruct, which can only view 8K context length, to LLama3-8B-80K, which can view 32K context length? Does LLama3-8B-80K perform the same as a vanilla 8K context model if it has a context length of only 8K?
Then, I also tried the released LLama3-8B-80K model https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged, and add \n as eos token (https://github.com/FlagOpen/FlagEmbedding/blob/681f61562269ca55cbfa756d0f91027cb809e73f/Long_LLM/longllm_qlora/main/eval_longbench.py#L111). However, the generated results of
narrativeqa subset
become:Did the authors find such a phenomenon?