WooooDyy / AgentGym

Code and implementations for the paper "AgentGym: Evolving Large Language Model-based Agents across Diverse Environments" by Zhiheng Xi et al.
https://arxiv.org/abs/2406.04151
MIT License
320 stars 38 forks source link

Inconsistent number of instructions for sciworld_test.json on HF dataset #27

Open xingjianleng opened 2 months ago

xingjianleng commented 2 months ago

Dear authors,

Thanks for your great work!

I'm trying to reproduce the evaluation results as shown in the paper. However, I just noticed a difference in the number of instructions between the paper and the code.

Table 2 of the paper says there are 200 evaluation instructions for the Sciworld environment, but there are 1042 samples in the sciworld_test.json on AgentEval HF dataset. Also, the conversation contents should be [], rather than all the trajectories.

Could you please update the sciworld_test.json file on HF datasets to the correct version, which should contain 200 samples and is without any conversation content?

Thanks in advance.

zouyingcao commented 1 month ago

Same question here~ Also, Bird dataset also meet the same inconsistency (claim 200 in the paper vs. 1534 in the HF)

Jerry-hyl commented 2 weeks ago

Same question. The sciworld_test.json is even in the format of training set. Could you please update it to the correct version?

Yiwen-Ding commented 2 weeks ago

Dear authors,

Thanks for your great work!

I'm trying to reproduce the evaluation results as shown in the paper. However, I just noticed a difference in the number of instructions between the paper and the code.

Table 2 of the paper says there are 200 evaluation instructions for the Sciworld environment, but there are 1042 samples in the sciworld_test.json on AgentEval HF dataset. Also, the conversation contents should be [], rather than all the trajectories.

Could you please update the sciworld_test.json file on HF datasets to the correct version, which should contain 200 samples and is without any conversation content?

Thanks in advance.

Same question here~ Also, Bird dataset also meet the same inconsistency (claim 200 in the paper vs. 1534 in the HF)

Same question. The sciworld_test.json is even in the format of training set. Could you please update it to the correct version?

Hi,

Apologies for the delayed response, and thank you for pointing out the mistake. The dataset we originally uploaded was incorrect. The correct test set has been constructed by randomly sampling a subset of the original test set, as described in the paper. We've now updated the dataset with the correct version here: https://huggingface.co/datasets/AgentGym/AgentEval. Additionally, we’ve removed the unnecessary conversation field from the test set.