THUDM / AgentBench

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
https://llmbench.ai
Apache License 2.0
2.03k stars 138 forks source link

minimum r in “Evaluation Prompt Setup”? #66

Open DryPilgrim opened 8 months ago

DryPilgrim commented 8 months ago

We select the minimum r such that count of all tokens in (u0, ar, ur+1, · · · , uk) is not greater than 3500.


1. 为什么是3500而不是其他数字?
2. 如果如果要限制 (u0, ar, ur+1, · · · , uk)的长度,应该同时限制r和k吧,这里只限制了r。
``` en
1. Why 3500 and not some other number?
2. If you want to limit the length of (u0, ar, ur+1, · · · , uk), you should limit both r and k. Here only r is limited.
Longin-Yu commented 8 months ago

在我们最初评测时,发现绝大多数大模型的上下文长度限制都在 4096 tokens 左右。虽然部分开源模型的限制为 2048,但如果限制过短,将无法执行 Agent 任务。为了在公平测试(相同标准)的前提下,兼顾这些模型,我们决定将长度限制在 4096 以内。

然而,由于不同模型使用的分词器(tokenizer)各异,且部分模型在分词后得到的标记符数量超过单词数量,因此我们设计了该方案,以满足几乎所有以 4096 作为上下文长度限制的模型需求。


When we initially planned the evaluation, we found that the context length limit of almost all large models was around 4096 tokens. Although some open-source models had a limit of 2048, if the limit was too short, it would be impossible to perform Agent tasks. To fairly test these models while maintaining the same standards, we decided to limit the length to 4096. However, due to the varying tokenizers used by different models and the fact that some models had more tokens after tokenization than word counts, we designed a solution to meet the requirements of almost all models with a 4096-token context length limit.