Closed xiaxiaxiatengxi closed 7 months ago
Thanks for your interest. We sometimes went into such issues with poor hyperparameters. To stablize training, it might be a good idea to consider decreasing the learning rate for the actor (we used 1e-5 for gpt2 actor so I imagine that might be too large for llama2 7b) and increasing the rollout_size (in general the larger the rollout_size the more stable the algorithm is). The sign of a well-chosen set of hyperparameters is that the metrics q1_mean and q2_mean improve stably.
您好,不好意思打扰到您。 我用我们的代码去训练webshop,效果变的越来越差。 我们先用2000条webshop数据训练了一个LoRa,之后在这个LoRa基础上训练llama2-7B。 我们的测试方法是:用200个webshop对话做测试,测试metrics是ngrams(n=2),初始化的LoRa得分是129.177,训练2000次迭代后68.546。训练参数是 `# Adversarial Attack Config defaults:
env
env_name: webshop
uses webshop queries from webshop_lower to webshop_upper
webshop_lower: 2000 webshop_upper: 2100
training hyperparameters
rollout_size: 512 #number of rollout trajectories for each update batch_size: 8 iterations: 2000 #total number of iterations epochs: 50 #number of epochs for the critic each iteration actor_epochs: 3 #number of epochs for the actor each iteration warmup_iter: 20 #number of iterations without updating the policy grad_accum_steps: 32 critic_lr: 2e-5 lm_lr: 1e-5 ` 特别是 在第2000抡的时候,我们的pi_action特别不符合规范。 pi_action ["Action:search[men's under 50 dollars long sleeve t-shirts {machine wash {3 { {color: under", "[ { machine wash men's { { { {[15/1776] { { { small { { { { { size: { { { { under {", '{ ', "Action:search[machine { { { size men's large { under 50 dollars { { x-large { { color infer no { { {", 'Action:click[size:medium x-large under 60 dollars machine { " { { " x-small { men\'s medium 1', 'Action:search[easy clean 27 inch un der 50 dollars]', "Action:search[women size x-large color under {x-large {black {men's machine {under {80 dollars button {fit", 'Action:click[nav y 5x x-large under 50 dollars]'] 所以想问一下, 1 想问作者在用 [Mistral-7B] 做RL训练的时候有先对模型做SFT么? 2 作者当时用llama2 训练的时候有做过SFT吗,是llama2最后输出也不好么,还有哪些我们忽视的点