Closed akk-123 closed 1 year ago
thanks, can you provide the prompt to ask chatgpt for reward score, and what is origin_scores in the provide data
Score different responses separately without explanation and without copying any input from these respects, please scores start from response 1: Relevance (does it relevant to user's query), Correctness (does it contain correct knowledge), Coherence (does it generate fluently and without grammar problems), Safety (does it refuse to answer sex or criminal queries) and give a score for each respect 1-5. Response 1: ... Response 2: ... Response 3: ...
@GanjinZero does the score range affect final RRHF result? for example score range is 1~100
or -10~10
We do not do extensive experiments on this. It can effect the performance.
so What range of scores do you recommend using? I notice the data you provide score is almost -2~0
,but the score by chatgpt is 10 ~ 20
I think you misunderstand something. The score is generated by the reward model. For data i provided among -2~0, the score is calculated by Dahoas/gptj-rm-static; for data with score from 10-20, the score is calculated by ChatGPT. These two score are from different dataset and different score criterion. If you do not have any specific purpose, I recommend you to use the chatgpt score.
Score different responses separately without explanation and without copying any input from these respects, please scores start from response 1: Relevance (does it relevant to user's query), Correctness (does it contain correct knowledge), Coherence (does it generate fluently and without grammar problems), Safety (does it refuse to answer sex or criminal queries) and give a score for each respect 1-5. Response 1: ... Response 2: ... Response 3: ...
Where should I put the query?
Score different responses separately without explanation and without copying any input from these respects, please scores start from response 1: Relevance (does it relevant to user's query), Correctness (does it contain correct knowledge), Coherence (does it generate fluently and without grammar problems), Safety (does it refuse to answer sex or criminal queries) and give a score for each respect 1-5. Response 1: ... Response 2: ... Response 3: ...
Where should I put the query?
Before response 1
We do not do extensive experiments on this. It can effect the performance.
为什么会说分数的范围会影响效果呢?从rrhf_loss的计算代码来看,感觉loss只跟候选回复分数的相对大小有关系,跟具体分数范围无关吧
因为你分数的范围可能会影响chatgpt的打分效果
因为你分数的范围可能会影响chatgpt的打分效果
ok了解了,感谢
I downloaded the data you provided from google drive and trained gpt2 with 8V100, but I got this error below in the middle of the training. It happened every time when the 339th data.
Traceback (most recent call last):
File "train.py", line 319, in
train()
File "train.py", line 313, in train
trainer.train()
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/trainer.py", line 2759, in training_step
loss = self.compute_loss(model, inputs)
File "train.py", line 271, in compute_loss
logits = model(input_ids=inputs.get('input_ids'), attention_mask=inputs.get('attention_mask'))[0] # (batch cand) L V
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward
output = self._fsdp_wrapped_module(*args, *kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1080, in forward
transformer_outputs = self.transformer(
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 903, in forward
outputs = block(
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 391, in forward
attn_outputs = self.attn(
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 332, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 202, in _attn
mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
I have no idea, you may: (1) delete 339th data (2) check the length of 339th data and the max length you set
Thanks for the explanation!
I'm training RRHF based on gpt2-large using 8 V100 32G, but got "CUDA out of memory" issue. Here is my bash:
export MODEL_PATH='gpt2-large' export SAVE_PATH='/newvolume/save_model/rrhf_gpt2_large' export MASTER_ADDR="localhost" export MASTER_PORT="7000" export GLOO_SOCKET_IFNAME="lo" export NCCL_SOCKET_IFNAME="lo" export WANDB_DISABLED=true wandb offline
cd ./RRHF-main python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=8 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path ~/dehong/data/rrhf_data/alpaca_responses_hh.json \ --bf16 False \ --output_dir $SAVE_PATH \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 100 \ --save_total_limit 40 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap offload" \ --fsdp_transformer_layer_cls_to_wrap 'GPT2Block' \ --tf32 False --model_max_length 192 --rrhf_weight 1
With fsdp, I feel this model (<1B) should be easily trained on 8 V100 and I'm not sure where the problem is.
I attach the error message here:
The actual batch size is 1 * query_count, so this may cost more memory than you think. You should try with more small max_length or reduce query count in your own data.
I see, thanks for the reply. I can understand you have to input all responses during training, but I'm still confuse why you need to train multiple queries each time.
I see, thanks for the reply. I can understand you have to input all responses during training, but I'm still confuse why you need to train multiple queries each time.
We input [query] + [response 1], ..., [query] + [response k] into Lamma each time. We repeatedly calculate query representation since this needs to change existing codes minimally.
If I understand correctly, technically I can just use 2 inputs [query] + [response 1] and [query] + [response 2] to train the model since both the ranking and sft loss only need pairwise data. But for here you use k inputs is just based on convenience purpose. Am I right?
You can use only 2 responses. But with using many responses, sft loss is different compared with using only 2 responses.
Got it! Now I'm using 8 A100 80G to finetune Alpaca-7B with the HH provided paired responses, but got CUDA OOM problem. The training went well in the first hundreds of steps, while I always got OOM error in the middle of training. I'm not sure where the problem is since I'm using the least amount of data and almost all parallel methods to save memory.
Here is my script: export MODEL_PATH='alpaca-7b' export SAVE_PATH='save_model/rrhf-alpaca-7b' export MASTER_ADDR="localhost" export MASTER_PORT="7000" export GLOO_SOCKET_IFNAME="lo" export NCCL_SOCKET_IFNAME="lo" export WANDB_DISABLED=true wandb offline
python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=8 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path data/alpaca_responses_hh.json \ --bf16 True \ --output_dir $SAVE_PATH \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 40 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --only_use_provide True \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True --model_max_length 192 --rrhf_weight 1
A possibility is your query is too long, model_max_length seems only truncate response but not query in my implementation.
model_max_length also truncates the query, in _single_tokenize()
if max_len is None: max_len = tokenizer.model_max_length
however I still encountered OOM when running a 7B model using 8*A100.
I think the answer is Yes, while I believe you have to use some distributed training framework, such as ZeRO3 of deepspeed. Since our codes are based on Huggingface's Trainer. I think such integration would be simple. You may follow the guideline at this link: https://huggingface.co/docs/transformers/main_classes/deepspeed