Will you release the benchmark dataset samples, evaluation metrics and methods?

CLUEbenchmark / SuperCLUE-Agent

SuperCLUE-Agent: 基于中文原生任务的Agent智能体核心能力测评基准

77 stars 2 forks source link

Open SilasTHU opened 5 months ago

SilasTHU commented 5 months ago

Now we can only see the scores of these models, but I'm very interested in how you evaluate these agents.