minimum r in “Evaluation Prompt Setup”？

在我们最初评测时，发现绝大多数大模型的上下文长度限制都在 4096 tokens 左右。虽然部分开源模型的限制为 2048，但如果限制过短，将无法执行 Agent 任务。为了在公平测试（相同标准）的前提下，兼顾这些模型，我们决定将长度限制在 4096 以内。

然而，由于不同模型使用的分词器（tokenizer）各异，且部分模型在分词后得到的标记符数量超过单词数量，因此我们设计了该方案，以满足几乎所有以 4096 作为上下文长度限制的模型需求。

When we initially planned the evaluation, we found that the context length limit of almost all large models was around 4096 tokens. Although some open-source models had a limit of 2048, if the limit was too short, it would be impossible to perform Agent tasks. To fairly test these models while maintaining the same standards, we decided to limit the length to 4096. However, due to the varying tokenizers used by different models and the fact that some models had more tokens after tokenization than word counts, we designed a solution to meet the requirements of almost all models with a 4096-token context length limit.

THUDM / AgentBench

minimum r in “Evaluation Prompt Setup”？ #66