Open xingjianleng opened 2 months ago
Same question here~ Also, Bird dataset also meet the same inconsistency (claim 200 in the paper vs. 1534 in the HF)
Same question. The sciworld_test.json is even in the format of training set. Could you please update it to the correct version?
Dear authors,
Thanks for your great work!
I'm trying to reproduce the evaluation results as shown in the paper. However, I just noticed a difference in the number of instructions between the paper and the code.
Table 2 of the paper says there are 200 evaluation instructions for the Sciworld environment, but there are 1042 samples in the sciworld_test.json on AgentEval HF dataset. Also, the conversation contents should be [], rather than all the trajectories.
Could you please update the
sciworld_test.json
file on HF datasets to the correct version, which should contain 200 samples and is without any conversation content?Thanks in advance.
Same question here~ Also, Bird dataset also meet the same inconsistency (claim 200 in the paper vs. 1534 in the HF)
Same question. The sciworld_test.json is even in the format of training set. Could you please update it to the correct version?
Hi,
Apologies for the delayed response, and thank you for pointing out the mistake. The dataset we originally uploaded was incorrect. The correct test set has been constructed by randomly sampling a subset of the original test set, as described in the paper. We've now updated the dataset with the correct version here: https://huggingface.co/datasets/AgentGym/AgentEval. Additionally, we’ve removed the unnecessary conversation field from the test set.
Dear authors,
Thanks for your great work!
I'm trying to reproduce the evaluation results as shown in the paper. However, I just noticed a difference in the number of instructions between the paper and the code.
Table 2 of the paper says there are 200 evaluation instructions for the Sciworld environment, but there are 1042 samples in the sciworld_test.json on AgentEval HF dataset. Also, the conversation contents should be [], rather than all the trajectories.
Could you please update the
sciworld_test.json
file on HF datasets to the correct version, which should contain 200 samples and is without any conversation content?Thanks in advance.