OpenLMLab / LEval

[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
349 stars 14 forks source link

Question about the leaderboard #16

Closed cizhenshi closed 3 months ago

cizhenshi commented 3 months ago

In the leaderboard, for the GPT-4 evaluated section, why is the sum of n_wins and n_draws not equal for each row? What evaluation method is used in the leaderboard? Is it 181 questions?

cizhenshi commented 3 months ago

: In comparing various models to Turbo-16k-0613 on open-ended tasks. We evaluate thesemodels on the 96-question subset using GPT-4 and two subsets (85+96 questions) using GPT-3.5.We reduce the positional biases by swapping paired predictions, so the GPT-4 evaluator is used in96×2 evaluation rounds, while the GPT3.5 evaluator is used in 181×2 rounds

ChenxinAn-fdu commented 3 months ago

Hi! We did not list the n_loss in this table (n_loss + n_wins + n_draws = total). For GPT-4 eval, we use 96-questions, but having more samples generally reduces the variance of the evaluation.

ChenxinAn-fdu commented 3 months ago

GPT-4 API is much more expensive when we did this work. That's why we only evaluate 96 samples.