Could you give an example of testing deepspeed-chat time?

youngyoung321 commented 2 weeks ago

I'm trying to learn the RLHF module recently. I saw the speed timetable of Deepspeed-chat in your arxiv paper, and you fixed the hybrid engine. I'm curious how you did it, and whether you can provide open source code or PR. dschat

hijkzzz commented 2 weeks ago

@wuxibin89 may know how to fix DSChat.

youngyoung321 commented 2 weeks ago

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.

	Optimized DSChat in OpenRLHF paper	DSChat in my experiments	gap
E2E time (s)	855.09	About 538	About 297
Generation time (s)	590.157	About 328	About 262
Training time (s)	125.69	About 148	About 23

Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:

ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \

And this is what I have in the output log about the time:

head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375

head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375

Looking forwards to your reply. Thank you for the sharing and suggestion again.

hijkzzz commented 2 weeks ago

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper.

Optimized DSChat in OpenRLHF paper DSChat in my experiments gap E2E time (s) 855.09 About 538 About 297 Generation time (s) 590.157 About 328 About 262 Training time (s) 125.69 About 148 About 23 Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:

ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \

And this is what I have in the output log about the time:

head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375

head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375

Looking forwards to your reply. Thank you for the sharing and suggestion again.

I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."

I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.

In addition, we updated the two checkpoints you used last week, which will also lead to different results. see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main

==============

If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.

wuxibin89 commented 2 weeks ago

@youngyoung321 From discussion https://github.com/microsoft/DeepSpeed/issues/4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.

youngyoung321 commented 2 weeks ago

Thanks for your reply, I found the fork version of Deepspeed-chat in @wuxibin89's library, saw a new bash file about ray, and some version updates suggestions, I tried using 2 nodes*8 A800s with a 1024 batch size for llama-7b experiment, although there are only a few rounds of data at present, is relatively stable. But, the total end-to-end time has a big gap with the arxiv paper. In particular, the generation time gap is very large with that in the paper. Optimized DSChat in OpenRLHF paper DSChat in my experiments gap E2E time (s) 855.09 About 538 About 297 Generation time (s) 590.157 About 328 About 262 Training time (s) 125.69 About 148 About 23 Can you give me some information? For further guidance, here is the configuration I am trying to run. The following is my launch script for your reference:
ACTOR_MODEL_PATH="OpenLLMAI/Llama-2-7b-sft-model-ocra-500k"
CRITIC_MODEL_PATH="OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194"
ACTOR_ZERO_STAGE=3
CRITIC_ZERO_STAGE=3
deepspeed --num_nodes=2 \
   -H hostfile \
    main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 8 \
   --per_device_training_batch_size 8 \
   --generation_batches 8 \
   --ppo_epochs 1 \
   --max_answer_seq_len 1024 \
   --max_prompt_seq_len 1024 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --offload \
   --offload_reference_model \
   --release_inference_cache \
   --gradient_accumulation_steps 8 \
   --actor_gradient_checkpointing \
   --critic_gradient_checkpointing \
   --actor_dropout 0.0 \
   --num_warmup_steps 0 \
   --deepspeed --seed 1234 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --inference_tp_size 1 \
And this is what I have in the output log about the time:
head: Epoch: 0 | Step: 7 | PPO Epoch: 1 | Actor Loss: 10.360088467597961 | Critic Loss: 54.418667793273926 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 544.72s, TFLOPs: 30.65, Samples/sec: 1.88, Time/seq 0.53s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 41.00s, Per-token Latency 40.03 ms, TFLOPs: 5.66, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.12s
head: Training   => Latency: 151.79s, TFLOPs: 97.79
head: End-to-End => Latency: 544.72s, Real-End-to-End: 544.72s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.0657958984375 | EMA reward score: 0.0657958984375
head: Epoch: 0 | Step: 15 | PPO Epoch: 1 | Actor Loss: 10.356070637702942 | Critic Loss: 54.373966217041016 | Unsupervised Loss: 0.0
head: End-to-End => Latency: 531.29s, TFLOPs: 31.43, Samples/sec: 1.93, Time/seq 0.52s, Batch Size: 1024, Total Seq. Length: 2048
head: Generation => Latency: 39.94s, Per-token Latency 39.00 ms, TFLOPs: 5.81, BW: -1.00 GB/sec, Answer Seq. Length: 1024
head: Preparation => Latency: 8.29s
head: Training   => Latency: 145.45s, TFLOPs: 102.05
head: End-to-End => Latency: 531.29s, Real-End-to-End: 531.29s
head: Actor Model Parameters => 6.738 B, Critic Model Parameters => 6.607 B
head: Average reward score: 0.1011962890625 | EMA reward score: 0.0693359375
Looking forwards to your reply. Thank you for the sharing and suggestion again.
I didn't conduct this experiment myself, so I am not familiar with the details. However, don't worry; the performance of OpenRLHF reported in the paper is not optimal either because we didn't enable the --colocate_critic_reward and --colocate_actor_ref options."

I guess the reason your test results differ from ours is due to different datasets. Different datasets can lead to various input and output lengths, and as training progresses, the outputs getting longer will slow down the process.

In addition, we updated the two checkpoints you used last week, which will also lead to different results. see https://huggingface.co/OpenLLMAI/Llama-2-7b-sft-model-ocra-500k/tree/main

==============

If you are interested in performance testing, I recommend ensuring that OpenRLHF and Optimized DSChat have the same input and output lengths/checkpoints. Enable --colocate_critic_reward and --colocate_actor_ref for OpenRLHF, increase the number of vLLM engines, and maximize the micro-batch size as much as possible.

Thanks again for your reply, I will try to use your suggestions

youngyoung321 commented 2 weeks ago

@youngyoung321 From discussion microsoft/DeepSpeed#4469 , I use this fork https://github.com/garrett4wade/DeepSpeed-for-dschat for deepspeed-chat benchmark.

Thanks for your reply and sharing, I will try to use this fork version

hijkzzz commented 1 week ago

Performance Tuning Guide: https://github.com/OpenLLMAI/OpenRLHF?tab=readme-ov-file#performance-tuning-guide

OpenLLMAI / OpenRLHF

Could you give an example of testing deepspeed-chat time? #327