Yifan-Song793 / ETO

Trial and Error: Exploration-Based Trajectory Optimization of LLM Agents (ACL 2024 Main Conference)
https://arxiv.org/abs/2403.02502
88 stars 9 forks source link

Issue About Constructing Preference Data (Webshop) #7

Closed xuehui1991 closed 1 month ago

xuehui1991 commented 3 months ago

Hello Team,

I am currently working with the "construct preference" script and have a couple of queries that I believe would benefit from some clarification.

  1. pred_traj_offset Setting: I noticed that for webshop data, the pred_traj_offset is set to 10. Could you please explain the reasoning behind this specific value?

  2. Reject Data Construction: In the script, when constructing the reject data, I see the following line of code:

    rejected = pred_trajs[key][:len(golden_trajs[key]) + 5*2][1:]

    What is the purpose of this slicing operation, and why is the length of the golden trajectories plus 5*2 used? I am particularly interested in understanding the significance of the 5*2 multiplier and how it contributes to the rejection criteria.

Thank you in advance for your time and assistance. I am looking forward to your insights on these matters.

Yifan-Song793 commented 3 months ago

Hello!

  1. pred_traj_offset refers to the length of the task prompt and in-context example (including action & observation) in the generated trajectory. When constructing preference trajectories, the in-context example is not taken into account.
  2. We add a slicing operation for the rejected trajectory. It is because some failed trajectories can be significantly longer than successful ones (e.g., repeatedly examing the temperature in ScienceWorld). The specific 5*2 steps is just a heuristic strategy, and we haven't yet conducted an ablation study on it.

I'll be adding some clarifying comments in the construct_preference.py script to enhance understanding. Thanks again!

xuehui1991 commented 3 months ago

Thank you for your quick response!

  1. Regarding the first question on setting the pred_traj_offset for webshop data, if we don't use the in-context examples, could we just set the pred_traj_offset to 2 instead of 10? As per the script, the pred_traj_offset is utilized at line 69, which extracts results from the exploration conversation except for the instruction_part.

  2. For the Webshop dataset, it seems there isn't a significant occurrence of excessively long conversations. Therefore, I am wondering if we could handle it similarly to how another dataset is processed at line 154. This approach could potentially streamline the process and align with the characteristics of the webshop data.

I appreciate your expertise and would love to hear your thoughts on this matter. Thank you.

Yifan-Song793 commented 3 months ago
  1. If you don't use the in-context examples, you can simply set the pred_traj_offset to 2. This value is determined by the length of task prompt and ICL examples before the actual task starts.
  2. In the case of WebShop, you can safely omit the slicing operation. It works only for ScienceWorld in our experiments.

Feel free to reach out with any questions!

xuehui1991 commented 3 months ago

Thanks @Yifan-Song793 , I truly appreciate your response.

The question that I met is I can't reproduce the results shown in Figure 4 of the paper.

My current experimental results show the following:

After SFT, the model's performance on the test set is avg reward=0.6392. After SFT + 1 iteration of ETO, the performance has little dropped from SFT. However, Figure 4 and Table 2 in the paper suggest that after SFT plus one iteration of ETO, the expected result should be avg reward=~0.67.

To facilitate a more accurate reproduction of your original results, I was wondering if you could kindly share the preference data (( e.g., webshop_pm_webshop-sft_1.json) )that was constructed after the exploration phase using the SFT-modified Llama2-7b model on the Webshop?

Thank you very much for considering my request.

xuehui1991 commented 3 months ago

@Yifan-Song793

sharptcode commented 2 months ago

@xuehui1991 May I know what hyperparameters you used in SFT training? After the default SFT training with llama2-7b-chat, the performance on webshop has avg reward=0.57, way below 0.6392 you got or the paper reported.

batch_size=64
num_train_epochs=3
learning_rate=2e-5
weight_decay=0.
xuehui1991 commented 2 months ago

Apologies for the delayed reply. I did use the parameters in run_eto.sh, which means as the same parameters you mention in SFT training. And I can reproduce the SFT result as paper.

Regarding the ETO outcome (SFT + rollout data + DPO training), I found the issue was identified in a recent version of transformers. train_dpo.py] (https://github.com/Yifan-Song793/ETO/blob/main/fastchat/train/train_dpo.py#L281), specifically line 281 and 289, if the ditype parameter isn't provided during model initialization, it could lead to unpredictable behavior during training. This doesn't trigger an error but generates a warning. I've addressed this and am now able to reproduce the results. Thank you for your assistance.