THUDM / LongBench

[ACL 2024] LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
MIT License
633 stars 45 forks source link

`Llama2-7B-chat-4k` on `PassageRetrieval-zh` gets `10.12` #61

Open fuqichen1998 opened 6 months ago

fuqichen1998 commented 6 months ago

As the title, my evaluation of Llama2-7B-chat-4k on PassageRetrieval-zh gets 10.12, which is significantly higher than the README (0.5), could you please share why?

bys0318 commented 6 months ago

Hi! Are you using the prompt template as in config/dataset2prompt.json?

bys0318 commented 6 months ago

We refer to our code here for the llama2 prompt: https://github.com/THUDM/LongBench/blob/main/pred.py#L33

fuqichen1998 commented 6 months ago

Yes, I was using your pred.py to run the inference and evaluation.

slatter666 commented 5 months ago

Yes, I was using your pred.py to run the inference and evaluation.

Acutally I also get the same result

condy0919 commented 1 month ago

We refer to our code here for the llama2 prompt: https://github.com/THUDM/LongBench/blob/main/pred.py#L33

The INST is necessary for llama2-7b/llama2-13b?