Closed xuehui1991 closed 1 month ago
Hello!
pred_traj_offset
refers to the length of the task prompt and in-context example (including action & observation) in the generated trajectory. When constructing preference trajectories, the in-context example is not taken into account.5*2
steps is just a heuristic strategy, and we haven't yet conducted an ablation study on it.I'll be adding some clarifying comments in the construct_preference.py
script to enhance understanding. Thanks again!
Thank you for your quick response!
Regarding the first question on setting the pred_traj_offset
for webshop data, if we don't use the in-context examples, could we just set the pred_traj_offset
to 2 instead of 10? As per the script, the pred_traj_offset is utilized at line 69, which extracts results from the exploration conversation except for the instruction_part.
For the Webshop dataset, it seems there isn't a significant occurrence of excessively long conversations. Therefore, I am wondering if we could handle it similarly to how another dataset is processed at line 154. This approach could potentially streamline the process and align with the characteristics of the webshop data.
I appreciate your expertise and would love to hear your thoughts on this matter. Thank you.
pred_traj_offset
to 2. This value is determined by the length of task prompt and ICL examples before the actual task starts.Feel free to reach out with any questions!
Thanks @Yifan-Song793 , I truly appreciate your response.
The question that I met is I can't reproduce the results shown in Figure 4 of the paper.
My current experimental results show the following:
After SFT, the model's performance on the test set is avg reward=0.6392. After SFT + 1 iteration of ETO, the performance has little dropped from SFT. However, Figure 4 and Table 2 in the paper suggest that after SFT plus one iteration of ETO, the expected result should be avg reward=~0.67.
To facilitate a more accurate reproduction of your original results, I was wondering if you could kindly share the preference data (( e.g., webshop_pm_webshop-sft_1.json) )that was constructed after the exploration phase using the SFT-modified Llama2-7b model on the Webshop?
Thank you very much for considering my request.
@Yifan-Song793
@xuehui1991 May I know what hyperparameters you used in SFT training? After the default SFT training with llama2-7b-chat, the performance on webshop has avg reward=0.57, way below 0.6392 you got or the paper reported.
batch_size=64
num_train_epochs=3
learning_rate=2e-5
weight_decay=0.
Apologies for the delayed reply. I did use the parameters in run_eto.sh, which means as the same parameters you mention in SFT training. And I can reproduce the SFT result as paper.
Regarding the ETO outcome (SFT + rollout data + DPO training), I found the issue was identified in a recent version of transformers. train_dpo.py] (https://github.com/Yifan-Song793/ETO/blob/main/fastchat/train/train_dpo.py#L281), specifically line 281 and 289, if the ditype parameter isn't provided during model initialization, it could lead to unpredictable behavior during training. This doesn't trigger an error but generates a warning. I've addressed this and am now able to reproduce the results. Thank you for your assistance.
Hello Team,
I am currently working with the "construct preference" script and have a couple of queries that I believe would benefit from some clarification.
pred_traj_offset
Setting: I noticed that for webshop data, thepred_traj_offset
is set to 10. Could you please explain the reasoning behind this specific value?Reject Data Construction: In the script, when constructing the reject data, I see the following line of code:
What is the purpose of this slicing operation, and why is the length of the golden trajectories plus
5*2
used? I am particularly interested in understanding the significance of the5*2
multiplier and how it contributes to the rejection criteria.Thank you in advance for your time and assistance. I am looking forward to your insights on these matters.