llama2-7B训练webshop效果越来越差了

您好，不好意思打扰到您。我用我们的代码去训练webshop，效果变的越来越差。我们先用2000条webshop数据训练了一个LoRa，之后在这个LoRa基础上训练llama2-7B。我们的测试方法是：用200个webshop对话做测试，测试metrics是ngrams(n=2)，初始化的LoRa得分是129.177，训练2000次迭代后68.546。训练参数是 `# Adversarial Attack Config defaults:

default
self
env

env_name: webshop

uses webshop queries from webshop_lower to webshop_upper

webshop_lower: 2000 webshop_upper: 2100

training hyperparameters

rollout_size: 512 #number of rollout trajectories for each update batch_size: 8 iterations: 2000 #total number of iterations epochs: 50 #number of epochs for the critic each iteration actor_epochs: 3 #number of epochs for the actor each iteration warmup_iter: 20 #number of iterations without updating the policy grad_accum_steps: 32 critic_lr: 2e-5 lm_lr: 1e-5 ` 特别是在第2000抡的时候，我们的pi_action特别不符合规范。 pi_action ["Action:search[men's under 50 dollars long sleeve t-shirts {machine wash {3 { {color: under", "[ { machine wash men's { { { {[15/1776] { { { small { { { { { size: { { { { under {", '{ ', "Action:search[machine { { { size men's large { under 50 dollars { { x-large { { color infer no { { {", 'Action:click[size:medium x-large under 60 dollars machine { " { { " x-small { men\'s medium 1', 'Action:search[easy clean 27 inch un der 50 dollars]', "Action:search[women size x-large color under {x-large {black {men's machine {under {80 dollars button {fit", 'Action:click[nav y 5x x-large under 50 dollars]'] 所以想问一下， 1 想问作者在用 [Mistral-7B] 做RL训练的时候有先对模型做SFT么？ 2 作者当时用llama2 训练的时候有做过SFT吗，是llama2最后输出也不好么，还有哪些我们忽视的点

YifeiZhou02 / ArCHer

llama2-7B训练webshop效果越来越差了 #5

env

uses webshop queries from webshop_lower to webshop_upper

training hyperparameters