FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
5.89k stars 427 forks source link

Question about LLama3-8B-80K performance on LongBench dataset. #931

Open ZetangForward opened 2 days ago

ZetangForward commented 2 days ago

Hi, I noticed in the tech report of LLama3-8B-80K that, the authors evaluate the vanilla LLama-8K-Instruct in the LongBench dataset with 8K context length, and obtain the following results: image

And the evaluation code here uses the 31500 context length: https://github.com/FlagOpen/FlagEmbedding/blob/681f61562269ca55cbfa756d0f91027cb809e73f/Long_LLM/longllm_qlora/main/eval_longbench.py#L40

Which is the best setting (context length) to evaluate the model on the LongBench dataset? 8K or around 32K? I also wonder Is it fair to compare LLama3-8B-Instruct, which can only view 8K context length, to LLama3-8B-80K, which can view 32K context length? Does LLama3-8B-80K perform the same as a vanilla 8K context model if it has a context length of only 8K?

Then, I also tried the released LLama3-8B-80K model https://huggingface.co/namespace-Pt/Llama-3-8B-Instruct-80K-QLoRA-Merged, and add \n as eos token (https://github.com/FlagOpen/FlagEmbedding/blob/681f61562269ca55cbfa756d0f91027cb809e73f/Long_LLM/longllm_qlora/main/eval_longbench.py#L111). However, the generated results of narrativeqa subset become:

image , where model can only generate "assistant\n"

Did the authors find such a phenomenon?

namespace-Pt commented 2 days ago

Hi,

ZetangForward commented 1 day ago

LLama3-8B-80K

Thanks for your response. Regarding question 1, have you tested the performance of the LLama3-8B-80K with just an 8K context length? I ask because I've observed that the vanilla LLama-3-8K can achieve a high score, and I'm curious if scaling the context might affect its performance.

namespace-Pt commented 1 day ago

The result I posted above (narrativeqa score=25.97) is the Llama-3-80k model with 8k context only. You can test more tasks if interested.