can RRHF train on v100 32G?

Yuanhy1997 commented 1 year ago

I think the answer is Yes, while I believe you have to use some distributed training framework, such as ZeRO3 of deepspeed. Since our codes are based on Huggingface's Trainer. I think such integration would be simple. You may follow the guideline at this link: https://huggingface.co/docs/transformers/main_classes/deepspeed

akk-123 commented 1 year ago

thanks, can you provide the prompt to ask chatgpt for reward score, and what is origin_scores in the provide data

GanjinZero commented 1 year ago

Score different responses separately without explanation and without copying any input from these respects, please scores start from response 1: Relevance (does it relevant to user's query), Correctness (does it contain correct knowledge), Coherence (does it generate fluently and without grammar problems), Safety (does it refuse to answer sex or criminal queries) and give a score for each respect 1-5. Response 1: ... Response 2: ... Response 3: ...

akk-123 commented 1 year ago

@GanjinZero does the score range affect final RRHF result? for example score range is 1~100 or -10~10

GanjinZero commented 1 year ago

We do not do extensive experiments on this. It can effect the performance.

akk-123 commented 1 year ago

so What range of scores do you recommend using? I notice the data you provide score is almost -2~0，but the score by chatgpt is 10 ~ 20

GanjinZero commented 1 year ago

I think you misunderstand something. The score is generated by the reward model. For data i provided among -2~0, the score is calculated by Dahoas/gptj-rm-static; for data with score from 10-20, the score is calculated by ChatGPT. These two score are from different dataset and different score criterion. If you do not have any specific purpose, I recommend you to use the chatgpt score.

dyyzhmm commented 1 year ago

Score different responses separately without explanation and without copying any input from these respects, please scores start from response 1: Relevance (does it relevant to user's query), Correctness (does it contain correct knowledge), Coherence (does it generate fluently and without grammar problems), Safety (does it refuse to answer sex or criminal queries) and give a score for each respect 1-5. Response 1: ... Response 2: ... Response 3: ...

Where should I put the query?

GanjinZero commented 1 year ago

Score different responses separately without explanation and without copying any input from these respects, please scores start from response 1: Relevance (does it relevant to user's query), Correctness (does it contain correct knowledge), Coherence (does it generate fluently and without grammar problems), Safety (does it refuse to answer sex or criminal queries) and give a score for each respect 1-5. Response 1: ... Response 2: ... Response 3: ...

Where should I put the query?

Before response 1

SuMeng123 commented 1 year ago

We do not do extensive experiments on this. It can effect the performance.

为什么会说分数的范围会影响效果呢？从rrhf_loss的计算代码来看，感觉loss只跟候选回复分数的相对大小有关系，跟具体分数范围无关吧

GanjinZero commented 1 year ago

因为你分数的范围可能会影响chatgpt的打分效果

SuMeng123 commented 1 year ago

因为你分数的范围可能会影响chatgpt的打分效果

ok了解了，感谢

DehongXu commented 1 year ago

I downloaded the data you provided from google drive and trained gpt2 with 8V100, but I got this error below in the middle of the training. It happened every time when the 339th data. Traceback (most recent call last):
File "train.py", line 319, in
train()
File "train.py", line 313, in train
trainer.train()
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/trainer.py", line 2759, in training_step
loss = self.compute_loss(model, inputs)
File "train.py", line 271, in compute_loss
logits = model(input_ids=inputs.get('input_ids'), attention_mask=inputs.get('attention_mask'))[0] # (batch cand) L V
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 748, in forward output = self._fsdp_wrapped_module(*args, *kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1080, in forward
transformer_outputs = self.transformer(
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 903, in forward
outputs = block(
File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs) File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 391, in forward attn_outputs = self.attn( File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 332, in forward attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask) File "/opt/conda/envs/rrhf/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 202, in _attn mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

GanjinZero commented 1 year ago

I have no idea, you may: (1) delete 339th data (2) check the length of 339th data and the max length you set

DehongXu commented 1 year ago

Thanks for the explanation!

I'm training RRHF based on gpt2-large using 8 V100 32G, but got "CUDA out of memory" issue. Here is my bash:

export MODEL_PATH='gpt2-large' export SAVE_PATH='/newvolume/save_model/rrhf_gpt2_large' export MASTER_ADDR="localhost" export MASTER_PORT="7000" export GLOO_SOCKET_IFNAME="lo" export NCCL_SOCKET_IFNAME="lo" export WANDB_DISABLED=true wandb offline

cd ./RRHF-main python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=8 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path ~/dehong/data/rrhf_data/alpaca_responses_hh.json \ --bf16 False \ --output_dir $SAVE_PATH \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 100 \ --save_total_limit 40 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --fsdp "full_shard auto_wrap offload" \ --fsdp_transformer_layer_cls_to_wrap 'GPT2Block' \ --tf32 False --model_max_length 192 --rrhf_weight 1

With fsdp, I feel this model (<1B) should be easily trained on 8 V100 and I'm not sure where the problem is.

DehongXu commented 1 year ago

I attach the error message here:

GanjinZero commented 1 year ago

The actual batch size is 1 * query_count, so this may cost more memory than you think. You should try with more small max_length or reduce query count in your own data.

DehongXu commented 1 year ago

I see, thanks for the reply. I can understand you have to input all responses during training, but I'm still confuse why you need to train multiple queries each time.

GanjinZero commented 1 year ago

I see, thanks for the reply. I can understand you have to input all responses during training, but I'm still confuse why you need to train multiple queries each time.

We input [query] + [response 1], ..., [query] + [response k] into Lamma each time. We repeatedly calculate query representation since this needs to change existing codes minimally.

DehongXu commented 1 year ago

If I understand correctly, technically I can just use 2 inputs [query] + [response 1] and [query] + [response 2] to train the model since both the ranking and sft loss only need pairwise data. But for here you use k inputs is just based on convenience purpose. Am I right?

GanjinZero commented 1 year ago

You can use only 2 responses. But with using many responses, sft loss is different compared with using only 2 responses.

DehongXu commented 1 year ago

Got it! Now I'm using 8 A100 80G to finetune Alpaca-7B with the HH provided paired responses, but got CUDA OOM problem. The training went well in the first hundreds of steps, while I always got OOM error in the middle of training. I'm not sure where the problem is since I'm using the least amount of data and almost all parallel methods to save memory.

Here is my script: export MODEL_PATH='alpaca-7b' export SAVE_PATH='save_model/rrhf-alpaca-7b' export MASTER_ADDR="localhost" export MASTER_PORT="7000" export GLOO_SOCKET_IFNAME="lo" export NCCL_SOCKET_IFNAME="lo" export WANDB_DISABLED=true wandb offline

python3 -m torch.distributed.launch --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} --nproc_per_node=8 --use_env train.py \ --model_name_or_path $MODEL_PATH \ --data_path data/alpaca_responses_hh.json \ --bf16 True \ --output_dir $SAVE_PATH \ --num_train_epochs 3 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 40 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --only_use_provide True \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \ --tf32 True --model_max_length 192 --rrhf_weight 1

GanjinZero commented 1 year ago

A possibility is your query is too long, model_max_length seems only truncate response but not query in my implementation.

shoyua commented 1 year ago

model_max_length also truncates the query, in _single_tokenize()
if max_len is None:
max_len = tokenizer.model_max_length
however I still encountered OOM when running a 7B model using 8*A100.

GanjinZero / RRHF

can RRHF train on v100 32G? #20