dvlab-research / LongLoRA

Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)
http://arxiv.org/abs/2309.12307
Apache License 2.0
2.62k stars 274 forks source link

LongBench evaluation #194

Open Clement25 opened 2 months ago

Clement25 commented 2 months ago

Hi Authors,

Thanks for the great work!

I tried to evaluate lognlora on LongBench (https://github.com/THUDM/LongBench) using the checkpoint of LongAlpaca-7B (https://huggingface.co/Yukang/LongAlpaca-7B). I load the model directly in LongBench evaluation benchmark following the same procedure in your repository and set test length in LongBench to be 31500 and use their default prompt template (as it has the context length of 32K). My results are a bit different than reported...

Name avg. Single-Doc QA Multi-Doc QA Summarization Few-shot Synthetic Code
Report 36.8 28.7 28.1 27.8 63.9 16.7 56.0
Our Reprd. 22.7 14.8 9.5 24.5 41.9 4.9 40.8

I suspect my code for preprocessing (data loading, filtering, matric calculation, etc.) or instruction construction may differ from yours, but I can not find any reference code for such configuration. In this case, could you please share a copy of evaluation code on LongBench? My personal email is henryhan88888@gmail.com.

Thanks a lot!