I tried to evaluate lognlora on LongBench (https://github.com/THUDM/LongBench) using the checkpoint of LongAlpaca-7B (https://huggingface.co/Yukang/LongAlpaca-7B). I load the model directly in LongBench evaluation benchmark following the same procedure in your repository and set test length in LongBench to be 31500 and use their default prompt template (as it has the context length of 32K). My results are a bit different than reported...
Name
avg.
Single-Doc QA
Multi-Doc QA
Summarization
Few-shot
Synthetic
Code
Report
36.8
28.7
28.1
27.8
63.9
16.7
56.0
Our Reprd.
22.7
14.8
9.5
24.5
41.9
4.9
40.8
I suspect my code for preprocessing (data loading, filtering, matric calculation, etc.) or instruction construction may differ from yours, but I can not find any reference code for such configuration. In this case, could you please share a copy of evaluation code on LongBench? My personal email is henryhan88888@gmail.com.
Hi Authors,
Thanks for the great work!
I tried to evaluate lognlora on LongBench (https://github.com/THUDM/LongBench) using the checkpoint of LongAlpaca-7B (https://huggingface.co/Yukang/LongAlpaca-7B). I load the model directly in LongBench evaluation benchmark following the same procedure in your repository and set test length in LongBench to be 31500 and use their default prompt template (as it has the context length of 32K). My results are a bit different than reported...
I suspect my code for preprocessing (data loading, filtering, matric calculation, etc.) or instruction construction may differ from yours, but I can not find any reference code for such configuration. In this case, could you please share a copy of evaluation code on LongBench? My personal email is henryhan88888@gmail.com.
Thanks a lot!