Closed cizhenshi closed 3 months ago
: In comparing various models to Turbo-16k-0613 on open-ended tasks. We evaluate thesemodels on the 96-question subset using GPT-4 and two subsets (85+96 questions) using GPT-3.5.We reduce the positional biases by swapping paired predictions, so the GPT-4 evaluator is used in96×2 evaluation rounds, while the GPT3.5 evaluator is used in 181×2 rounds
Hi! We did not list the n_loss
in this table (n_loss + n_wins + n_draws = total). For GPT-4 eval, we use 96-questions, but having more samples generally reduces the variance of the evaluation.
GPT-4 API is much more expensive when we did this work. That's why we only evaluate 96 samples.
In the leaderboard, for the GPT-4 evaluated section, why is the sum of n_wins and n_draws not equal for each row? What evaluation method is used in the leaderboard? Is it 181 questions?