LongBench evaluation - Githubissues

Hi Authors,

Thanks for the great work!

I tried to evaluate lognlora on LongBench (https://github.com/THUDM/LongBench) using the checkpoint of LongAlpaca-7B (https://huggingface.co/Yukang/LongAlpaca-7B). I load the model directly in LongBench evaluation benchmark following the same procedure in your repository and set test length in LongBench to be 31500 and use their default prompt template (as it has the context length of 32K). My results are a bit different than reported...

Name	avg.	Single-Doc QA	Multi-Doc QA	Summarization	Few-shot	Synthetic	Code
Report	36.8	28.7	28.1	27.8	63.9	16.7	56.0
Our Reprd.	22.7	14.8	9.5	24.5	41.9	4.9	40.8

I suspect my code for preprocessing (data loading, filtering, matric calculation, etc.) or instruction construction may differ from yours, but I can not find any reference code for such configuration. In this case, could you please share a copy of evaluation code on LongBench? My personal email is henryhan88888@gmail.com.

Thanks a lot!

dvlab-research / LongLoRA

LongBench evaluation #194