Closed QingChengLineOne closed 11 months ago
你试了吗 agentlm7b在agent-bench做测试,比如HH里面效果怎么样?
你试了吗 agentlm7b在agent-bench做测试,比如HH里面效果怎么样?
还没有,我目前对怎么进行TRAJECTORY FILTERING比较困惑
和 AgentBench 采取的指标一样,具体如下 | Task | Description | Example | Reward | Reward Calculation |
---|---|---|---|---|---|
ALFWorld | Daily Household Routines | Heat food | Success Rate | If task is finished, r=1, otherwise r=0 | |
WebShop | Online Shopping | Buy a shirt | Reward | Score for selecting the correct item during shopping | |
Mind2Web | Website Navigation | Book a ticket | Step Success Rate | Evaluate the predicted action correctness compared to reference actions. | |
KG | Retrieve Entity from KG | Which team won the 2014 AFC Championship Game? | F1 | Compare the model’s predicted answers to the gold standard answers | |
DB | Database Operations | How many games did the badgers play in october? | Step Success | If MySQL query is correct, r=1, otherwise r=0 | |
OS | Interacting with OS | Count specific files | Step Success | If result from operating system is correct, r=1, otherwise r=0 |
是6个数据集都用reward作为指标进行筛选?还是像agentbench里面的那样,os用SR,KG用F1,DCG用reward