vllm +zero2 hangs - Githubissues

karthik19967829 commented 4 months ago

Team, thank you so much for this wonderful toolkit! we are trying to test the vllm setting with mistralai/Mistral-7B-Instruct-v0.2 model with zero2

ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \ -- python3 examples/train_ppo_ray.py \ --ref_num_nodes 1 \ --ref_num_gpus_per_node 1 \ --reward_num_nodes 1 \ --reward_num_gpus_per_node 1 \ --critic_num_nodes 1 \ --critic_num_gpus_per_node 1 \ --actor_num_nodes 1 \ --actor_num_gpus_per_node 4 \ --pretrain openchat/openchat_3.5 \ --reward_pretrain openchat/openchat_3.5 \ --critic_pretrain openchat/openchat_3.5 \ --save_path /openrlhf/examples/scripts/ckpt/starling_7b \ --micro_train_batch_size 4 \ --train_batch_size 128 \ --micro_rollout_batch_size 16 \ --rollout_batch_size 256 \ --max_epochs 1 \ --prompt_max_len 1024 \ --generate_max_len 1024 \ --zero_stage 2 \ --bf16 \ --actor_learning_rate 2e-7 \ --critic_learning_rate 3e-6 \ --init_kl_coef 0.001 \ --prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf \ --prompt_data_probs 1 \ --max_samples 256 \ --actor_init_on_gpu \ --adam_offload \ --gradient_checkpointing \ --vllm_num_engines 1 \ --vllm_tensor_parallel_size 1

hangs after loading the models and launching run on wandb

hijkzzz commented 4 months ago

For the 7b model, it seems you can try without vllm

karthik19967829 commented 4 months ago

yup we want to run 34B+ models , we were testing the vllm setup with 7b m as it works without it fine

wuxibin89 commented 4 months ago

@karthik19967829 I can't reproduce this problem with your script, my job is succeeded as expect. Can your post ray job supervisor's log? You can find it at /tmp/ray/session_latest/logs/job-driver-raysubmit_{JOBID}.log

karthik19967829 commented 4 months ago

cool will do that thanks , I am using 1 node with 8 GPUs , may I know your exact hardware setup and run command ?

wuxibin89 commented 4 months ago

My hardware info is 1 node with 8 A100 GPUs, and run command is:

ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{"working_dir": "."}' \
    --no-wait \
    -- python3 examples/train_ppo_ray.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 1 \
    --reward_num_nodes 1 \
    --reward_num_gpus_per_node 1 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 1 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 4 \
    --vllm_num_engines 1 \
    --vllm_tensor_parallel_size 1 \
    --pretrain mistralai/Mistral-7B-v0.1 \
    --reward_pretrain mistralai/Mistral-7B-v0.1 \
    --save_path /openrlhf/examples/scripts/ckpt/starling_7b \
    --micro_train_batch_size 4 \
    --train_batch_size 128 \
    --micro_rollout_batch_size 16 \
    --rollout_batch_size 256 \
    --max_epochs 1 \
    --prompt_max_len 1024 \
    --generate_max_len 1024 \
    --zero_stage 2 \
    --bf16 \
    --actor_learning_rate 2e-7 \
    --critic_learning_rate 3e-6 \
    --init_kl_coef 0.001 \
    --prompt_data Open-Orca/OpenOrca \
    --prompt_data_probs 1 \
    --max_samples 256 \
    --actor_init_on_gpu \
    --adam_offload \
    --gradient_checkpointing

karthik19967829 commented 4 months ago

also could you share the exact version of libraries by using pip list in your environment ?

thank you so much for the quick response :) hope we can build something cool together

tianhao-nexusflow commented 4 months ago

@wuxibin89 I'm encountering the same problem, and this is in /tmp/ray/sessionlatest/logs/job-driver-raysubmit{JOBID}.log:

(openrlhf) root@401e005161f6:/tmp/ray/session_latest/logs# cat job-driver-raysubmit_BqhS5Hp5fuBzecG1.log 
[2024-02-17 06:33:00,146] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2024-02-17 06:33:04,141 INFO worker.py:1405 -- Using address 0.0.0.0:6379 set in the environment variable RAY_ADDRESS
2024-02-17 06:33:04,141 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 0.0.0.0:6379...
2024-02-17 06:33:04,148 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
(pid=163692) [2024-02-17 06:33:07,082] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=163692) /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
(pid=163692)   warnings.warn(
***** constructed actor model: {actor_model}
***** constructed critic model: {critic_model}
***** constructed reference model: {ref_model}
***** constructed reward models: {reward_models}
(ActorModelRayActor pid=163692) [2024-02-17 06:33:14,446] [INFO] [comm.py:637:init_distributed] cdb=None
(ActorModelRayActor pid=163692) [2024-02-17 06:33:14,446] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(pid=163889) [2024-02-17 06:33:11,717] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
***** constructed vLLM engines: {vllm_engines}
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
(pid=163888) /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations [repeated 2x across cluster]
(pid=163888)   warnings.warn( [repeated 2x across cluster]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:01<00:03,  1.72s/it]
(RewardModelRayActor pid=164150) INFO 02-17 06:33:19 model.py:190] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
(RewardModelRayActor pid=164150) INFO 02-17 06:33:19 model.py:190] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
(RewardModelRayActor pid=164150) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.52s/it]
(ActorModelRayActor pid=163692) Actor(
(ActorModelRayActor pid=163692)   (model): LlamaForCausalLM(
(ActorModelRayActor pid=163692)     (model): LlamaModel(
(ActorModelRayActor pid=163692)       (embed_tokens): Embedding(32000, 4096)
(ActorModelRayActor pid=163692)       (layers): ModuleList(
(ActorModelRayActor pid=163692)         (0-31): 32 x LlamaDecoderLayer(
(ActorModelRayActor pid=163692)           (self_attn): LlamaFlashAttention2(
(ActorModelRayActor pid=163692)             (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692)             (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692)             (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692)             (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692)             (rotary_emb): LlamaRotaryEmbedding()
(ActorModelRayActor pid=163692)           )
(ActorModelRayActor pid=163692)           (mlp): LlamaMLP(
(ActorModelRayActor pid=163692)             (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(ActorModelRayActor pid=163692)             (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(ActorModelRayActor pid=163692)             (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(ActorModelRayActor pid=163692)             (act_fn): SiLU()
(ActorModelRayActor pid=163692)           )
(ActorModelRayActor pid=163692)           (input_layernorm): LlamaRMSNorm()
(ActorModelRayActor pid=163692)           (post_attention_layernorm): LlamaRMSNorm()
(ActorModelRayActor pid=163692)         )
(ActorModelRayActor pid=163692)       )
(ActorModelRayActor pid=163692)       (norm): LlamaRMSNorm()
(ActorModelRayActor pid=163692)     )
(ActorModelRayActor pid=163692)     (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
(ActorModelRayActor pid=163692)   )
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) dataset: Open-Orca/OpenOrca
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:19,311] [INFO] [comm.py:637:init_distributed] cdb=None [repeated 3x across cluster]
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:19,311] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [repeated 2x across cluster]
(pid=164245) [2024-02-17 06:33:17,109] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 4x across cluster]
(LLMRayActor pid=164245) INFO 02-17 06:33:19 llm_engine.py:79] Initializing an LLM engine with config: model='OpenLLMAI/Llama-2-7b-sft-model-ocra-500k', tokenizer='OpenLLMAI/Llama-2-7b-sft-model-ocra-500k', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=42)
(LLMRayActor pid=164245) INFO 02-17 06:33:21 weight_utils.py:163] Using model weights format ['*.safetensors']
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s] [repeated 3x across cluster]
(pid=164245) /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations [repeated 4x across cluster]
(pid=164245)   warnings.warn( [repeated 4x across cluster]
(ActorModelRayActor pid=163692) dataset: Dahoas/full-hh-rlhf
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:03<00:01,  1.55s/it] [repeated 6x across cluster]
(LLMRayActor pid=164245) INFO 02-17 06:33:23 llm_engine.py:337] # GPU blocks: 7406, # CPU blocks: 512
(ActorModelRayActor pid=163692) dataset: tasksource/oasst1_pairwise_rlhf_reward
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:24,309] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:24,309] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
(RewardModelRayActor pid=164150) LLMForSequenceRegression(
(RewardModelRayActor pid=164150)   (value_head): Linear(in_features=4096, out_features=1, bias=False)
(RewardModelRayActor pid=164150) reward normalization status: True
(RewardModelRayActor pid=164150) mean: tensor([0.5352], dtype=torch.bfloat16), std tensor([1.8750], dtype=torch.bfloat16)
(LLMRayActor pid=164245) INFO 02-17 06:33:24 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMRayActor pid=164245) INFO 02-17 06:33:24 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,359] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,360] [INFO] [logging.py:96:log_dist] [Rank 0] Creating BF16 optimizer
(ReferenceModelRayActor pid=164147) Actor(
(ReferenceModelRayActor pid=164147)   (model): LlamaForCausalLM(
(RewardModelRayActor pid=164150)   (model): LlamaModel( [repeated 2x across cluster]
(RewardModelRayActor pid=164150)     (embed_tokens): Embedding(32000, 4096, padding_idx=2) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)     (layers): ModuleList( [repeated 2x across cluster]
(RewardModelRayActor pid=164150)       (0-31): 32 x LlamaDecoderLayer( [repeated 2x across cluster]
(RewardModelRayActor pid=164150)         (self_attn): LlamaFlashAttention2( [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (q_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (k_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (v_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (o_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (rotary_emb): LlamaRotaryEmbedding() [repeated 2x across cluster]
(RewardModelRayActor pid=164150) ) [repeated 13x across cluster]
(RewardModelRayActor pid=164150)         (mlp): LlamaMLP( [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (up_proj): Linear(in_features=4096, out_features=11008, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (down_proj): Linear(in_features=11008, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150)           (act_fn): SiLU() [repeated 2x across cluster]
(RewardModelRayActor pid=164150)         (input_layernorm): LlamaRMSNorm() [repeated 2x across cluster]
(RewardModelRayActor pid=164150)         (post_attention_layernorm): LlamaRMSNorm() [repeated 2x across cluster]
(RewardModelRayActor pid=164150)     (norm): LlamaRMSNorm() [repeated 2x across cluster]
(ReferenceModelRayActor pid=164147)     (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,475] [INFO] [utils.py:791:see_memory_usage] begin bf16_optimizer
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,476] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB         Max_MA 12.61 GB         CA 12.62 GB         Max_CA 13 GB 
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,476] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 67.32 GB, percent = 3.3%
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,595] [INFO] [utils.py:791:see_memory_usage] end bf16_optimizer
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   activation_checkpointing_config  {
(ReferenceModelRayActor pid=164147)     "partition_activations": false, 
(ReferenceModelRayActor pid=164147)     "contiguous_memory_optimization": false, 
(ReferenceModelRayActor pid=164147)     "cpu_checkpointing": false, 
(ReferenceModelRayActor pid=164147)     "number_checkpoints": null, 
(ReferenceModelRayActor pid=164147)     "synchronize_checkpoint_boundary": false, 
(ReferenceModelRayActor pid=164147)     "profile": false
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   amp_enabled .................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   amp_params ................... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   autotuning_config ............ {
(ReferenceModelRayActor pid=164147)     "enabled": false, 
(ReferenceModelRayActor pid=164147)     "start_step": null, 
(ReferenceModelRayActor pid=164147)     "end_step": null, 
(ReferenceModelRayActor pid=164147)     "metric_path": null, 
(ReferenceModelRayActor pid=164147)     "arg_mappings": null, 
(ReferenceModelRayActor pid=164147)     "metric": "throughput", 
(ReferenceModelRayActor pid=164147)     "model_info": null, 
(ReferenceModelRayActor pid=164147)     "results_dir": "autotuning_results", 
(ReferenceModelRayActor pid=164147)     "exps_dir": "autotuning_exps", 
(ReferenceModelRayActor pid=164147)     "overwrite": true, 
(ReferenceModelRayActor pid=164147)     "fast": true, 
(ReferenceModelRayActor pid=164147)     "start_profile_step": 3, 
(ReferenceModelRayActor pid=164147)     "end_profile_step": 5, 
(ReferenceModelRayActor pid=164147)     "tuner_type": "gridsearch", 
(ReferenceModelRayActor pid=164147)     "tuner_early_stopping": 5, 
(ReferenceModelRayActor pid=164147)     "tuner_num_trials": 50, 
(ReferenceModelRayActor pid=164147)     "model_info_path": null, 
(ReferenceModelRayActor pid=164147)     "mp_size": 1, 
(ReferenceModelRayActor pid=164147)     "max_train_batch_size": null, 
(ReferenceModelRayActor pid=164147)     "min_train_batch_size": 1, 
(ReferenceModelRayActor pid=164147)     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
(ReferenceModelRayActor pid=164147)     "min_train_micro_batch_size_per_gpu": 1, 
(ReferenceModelRayActor pid=164147)     "num_tuning_micro_batch_sizes": 3
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   bfloat16_enabled ............. True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   checkpoint_parallel_write_pipeline  False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   checkpoint_tag_validation_enabled  True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print]   checkpoint_tag_validation_fail  False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa42f7a1810>
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   communication_data_type ...... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   curriculum_enabled_legacy .... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   curriculum_params_legacy ..... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   data_efficiency_enabled ...... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   dataloader_drop_last ......... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   disable_allgather ............ False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   dump_state ................... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   dynamic_loss_scale_args ...... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_enabled ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_gas_boundary_resolution  1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_layer_name ........ bert.encoder.layer
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_layer_num ......... 0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_max_iter .......... 100
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_stability ......... 1e-06
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_tol ............... 0.01
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   eigenvalue_verbose ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   elasticity_enabled ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   flops_profiler_config ........ {
(ReferenceModelRayActor pid=164147)     "enabled": false, 
(ReferenceModelRayActor pid=164147)     "recompute_fwd_factor": 0.0, 
(ReferenceModelRayActor pid=164147)     "profile_step": 1, 
(ReferenceModelRayActor pid=164147)     "module_depth": -1, 
(ReferenceModelRayActor pid=164147)     "top_modules": 1, 
(ReferenceModelRayActor pid=164147)     "detailed": true, 
(ReferenceModelRayActor pid=164147)     "output_file": null
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   fp16_auto_cast ............... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   fp16_enabled ................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   fp16_master_weights_and_gradients  False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   global_rank .................. 0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   grad_accum_dtype ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   gradient_accumulation_steps .. 32
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   gradient_clipping ............ 1.0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   gradient_predivide_factor .... 1.0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   graph_harvesting ............. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   initial_dynamic_scale ........ 1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   load_universal_checkpoint .... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   loss_scale ................... 1.0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   memory_breakdown ............. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   mics_hierarchial_params_gather  False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   mics_shard_size .............. -1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   nebula_config ................ {
(ReferenceModelRayActor pid=164147)     "enabled": false, 
(ReferenceModelRayActor pid=164147)     "persistent_storage_path": null, 
(ReferenceModelRayActor pid=164147)     "persistent_time_interval": 100, 
(ReferenceModelRayActor pid=164147)     "num_of_version_in_retention": 2, 
(ReferenceModelRayActor pid=164147)     "enable_nebula_load": true, 
(ReferenceModelRayActor pid=164147)     "load_path": null
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   optimizer_legacy_fusion ...... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   optimizer_name ............... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   optimizer_params ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   pld_enabled .................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   pld_params ................... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   prescale_gradients ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   scheduler_name ............... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   scheduler_params ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   seq_parallel_communication_data_type  torch.float32
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   sparse_attention ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   sparse_gradients_enabled ..... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   steps_per_print .............. 100
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   train_batch_size ............. 128
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   train_micro_batch_size_per_gpu  4
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   use_data_before_expert_parallel_  False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   use_node_local_storage ....... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   wall_clock_breakdown ......... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   weight_quantization_config ... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   world_size ................... 1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   zero_allow_untested_optimizer  False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   zero_enabled ................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print]   zero_optimization_stage ...... 0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,600] [INFO] [config.py:974:print_user_config]   json = {
(ReferenceModelRayActor pid=164147)     "steps_per_print": 100, 
(ReferenceModelRayActor pid=164147)     "zero_optimization": {
(ReferenceModelRayActor pid=164147)         "stage": 0, 
(ReferenceModelRayActor pid=164147)         "stage3_param_persistence_threshold": "auto", 
(ReferenceModelRayActor pid=164147)         "offload_param": {
(ReferenceModelRayActor pid=164147)             "device": "none", 
(ReferenceModelRayActor pid=164147)             "pin_memory": true
(ReferenceModelRayActor pid=164147)         }
(ReferenceModelRayActor pid=164147)     }, 
(ReferenceModelRayActor pid=164147)     "bf16": {
(ReferenceModelRayActor pid=164147)         "enabled": true
(ReferenceModelRayActor pid=164147)     }, 
(ReferenceModelRayActor pid=164147)     "gradient_clipping": 1.0, 
(ReferenceModelRayActor pid=164147)     "prescale_gradients": false, 
(ReferenceModelRayActor pid=164147)     "wall_clock_breakdown": false, 
(ReferenceModelRayActor pid=164147)     "train_micro_batch_size_per_gpu": 4, 
(ReferenceModelRayActor pid=164147)     "train_batch_size": 128
(ReferenceModelRayActor pid=164147) }
(ActorModelRayActor pid=163692) [Dataset({
(ActorModelRayActor pid=163692)     features: ['id', 'system_prompt', 'question', 'response'],
(ActorModelRayActor pid=163692)     num_rows: 80000
(ActorModelRayActor pid=163692) }), Dataset({
(ActorModelRayActor pid=163692)     features: ['prompt', 'response', 'chosen', 'rejected'],
(ActorModelRayActor pid=163692)     num_rows: 80000
(ActorModelRayActor pid=163692) }), Dataset({
(ActorModelRayActor pid=163692)     features: ['lang', 'parent_id', 'prompt', 'chosen', 'rejected'],
(ActorModelRayActor pid=163692)     num_rows: 17966
(ActorModelRayActor pid=163692) })]
(LLMRayActor pid=164245) INFO 02-17 06:33:30 model_runner.py:738] Graph capturing finished in 5 secs.
(RewardModelRayActor pid=164150) [2024-02-17 06:33:24,423] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(RewardModelRayActor pid=164150) [2024-02-17 06:33:24,424] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
  0%|          | 0/80000 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.47s/it] [repeated 3x across cluster]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:03<00:01,  1.62s/it]
  1%|          | 951/80000 [00:00<00:08, 9504.34it/s]
  2%|▏         | 1917/80000 [00:00<00:08, 9590.07it/s]
  4%|▎         | 2885/80000 [00:00<00:08, 9627.75it/s]
  5%|▍         | 3848/80000 [00:00<00:07, 9600.03it/s]
  6%|▌         | 4813/80000 [00:00<00:07, 9615.42it/s]
  7%|▋         | 5775/80000 [00:00<00:07, 9605.78it/s]
  8%|▊         | 6740/80000 [00:00<00:07, 9619.43it/s]
 10%|▉         | 7708/80000 [00:00<00:07, 9636.96it/s]
 11%|█         | 8672/80000 [00:00<00:07, 9619.40it/s]
 12%|█▏        | 9634/80000 [00:01<00:07, 9611.21it/s]
 13%|█▎        | 10596/80000 [00:01<00:07, 9600.97it/s]
 16%|█▌        | 12520/80000 [00:01<00:07, 9574.41it/s]
 17%|█▋        | 13478/80000 [00:01<00:06, 9571.38it/s]
 18%|█▊        | 14436/80000 [00:01<00:06, 9571.39it/s]
 19%|█▉        | 15394/80000 [00:01<00:06, 9565.31it/s]
 20%|██        | 16351/80000 [00:01<00:06, 9560.12it/s]
 22%|██▏       | 17308/80000 [00:01<00:06, 9552.46it/s]
 24%|██▍       | 19222/80000 [00:02<00:06, 9558.89it/s]
 25%|██▌       | 20178/80000 [00:02<00:06, 9554.53it/s]
 26%|██▋       | 21135/80000 [00:02<00:06, 9556.74it/s]
 28%|██▊       | 22097/80000 [00:02<00:06, 9574.49it/s]
 29%|██▉       | 23055/80000 [00:02<00:06, 9075.43it/s]
 30%|███       | 24006/80000 [00:02<00:06, 9200.13it/s]
 31%|███       | 24955/80000 [00:02<00:05, 9284.46it/s]
 32%|███▏      | 25900/80000 [00:02<00:05, 9331.45it/s]
 34%|███▎      | 26841/80000 [00:02<00:05, 9352.24it/s]
 35%|███▍      | 27787/80000 [00:02<00:05, 9383.24it/s]
 36%|███▌      | 28727/80000 [00:03<00:05, 9318.18it/s]
 37%|███▋      | 29678/80000 [00:03<00:05, 9373.53it/s]
 38%|███▊      | 30639/80000 [00:03<00:05, 9442.38it/s]
 39%|███▉      | 31593/80000 [00:03<00:05, 9468.62it/s]
 41%|████      | 32541/80000 [00:03<00:05, 9358.85it/s]
 42%|████▏     | 33492/80000 [00:03<00:04, 9403.62it/s]
 43%|████▎     | 34433/80000 [00:03<00:04, 9123.18it/s]
 44%|████▍     | 35385/80000 [00:03<00:04, 9238.95it/s]
 45%|████▌     | 36342/80000 [00:03<00:04, 9334.38it/s]
 47%|████▋     | 37277/80000 [00:03<00:04, 9083.53it/s]
 48%|████▊     | 38232/80000 [00:04<00:04, 9217.42it/s]
 49%|████▉     | 39196/80000 [00:04<00:04, 9340.17it/s]
 50%|█████     | 40157/80000 [00:04<00:04, 9418.07it/s]
 51%|█████▏    | 41101/80000 [00:04<00:04, 9133.29it/s]
 53%|█████▎    | 42049/80000 [00:04<00:04, 9234.17it/s]
 54%|█████▍    | 43003/80000 [00:04<00:03, 9322.26it/s]
 55%|█████▍    | 43960/80000 [00:04<00:03, 9393.00it/s]
 56%|█████▌    | 44901/80000 [00:04<00:03, 9107.29it/s]
 57%|█████▋    | 45860/80000 [00:04<00:03, 9246.97it/s]
 60%|█████▉    | 47772/80000 [00:05<00:03, 9405.55it/s]
 61%|██████    | 48736/80000 [00:05<00:03, 9475.03it/s]
 62%|██████▏   | 49694/80000 [00:05<00:03, 9504.24it/s]
 63%|██████▎   | 50646/80000 [00:05<00:03, 9144.76it/s]
 65%|██████▍   | 51605/80000 [00:05<00:03, 9272.29it/s]
 66%|██████▌   | 52563/80000 [00:05<00:02, 9361.13it/s]
 67%|██████▋   | 53522/80000 [00:05<00:02, 9426.24it/s]
 68%|██████▊   | 54474/80000 [00:05<00:02, 9452.18it/s]
 70%|███████   | 56377/80000 [00:06<00:02, 9158.12it/s]
 72%|███████▏  | 57338/80000 [00:06<00:02, 9287.46it/s]
 73%|███████▎  | 58298/80000 [00:06<00:02, 9377.12it/s]
 74%|███████▍  | 59241/80000 [00:06<00:02, 9392.38it/s]
 75%|███████▌  | 60190/80000 [00:06<00:02, 9419.29it/s]
 76%|███████▋  | 61151/80000 [00:06<00:01, 9475.33it/s]
 79%|███████▉  | 63053/80000 [00:06<00:01, 9481.31it/s]
 80%|████████  | 64010/80000 [00:06<00:01, 9505.16it/s]
 81%|████████  | 64961/80000 [00:06<00:01, 9503.69it/s]
 82%|████████▏ | 65912/80000 [00:07<00:01, 9491.31it/s]
 84%|████████▎ | 66862/80000 [00:07<00:01, 9468.92it/s]
 85%|████████▍ | 67810/80000 [00:07<00:01, 9469.67it/s]
 86%|████████▌ | 68759/80000 [00:07<00:01, 9473.26it/s]
 87%|████████▋ | 69707/80000 [00:07<00:01, 9344.96it/s]
 88%|████████▊ | 70642/80000 [00:07<00:01, 9255.83it/s]
 89%|████████▉ | 71592/80000 [00:07<00:00, 9326.48it/s]
 91%|█████████ | 72532/80000 [00:07<00:00, 9346.13it/s]
 92%|█████████▏| 73484/80000 [00:07<00:00, 9395.20it/s]
 93%|█████████▎| 74424/80000 [00:07<00:00, 9289.45it/s]
 94%|█████████▍| 75373/80000 [00:08<00:00, 9346.70it/s]
 97%|█████████▋| 77260/80000 [00:08<00:00, 9156.91it/s]
 98%|█████████▊| 78216/80000 [00:08<00:00, 9272.86it/s]
 99%|█████████▉| 79164/80000 [00:08<00:00, 9333.33it/s]
100%|██████████| 80000/80000 [00:08<00:00, 9377.14it/s]
(ActorModelRayActor pid=163888) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,355] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,357] [INFO] [logging.py:96:log_dist] [Rank 0] Creating BF16 optimizer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,519] [INFO] [utils.py:791:see_memory_usage] begin bf16_optimizer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,683] [INFO] [utils.py:792:see_memory_usage] MA 12.37 GB         Max_MA 12.37 GB         CA 12.37 GB         Max_CA 12 GB  [repeated 3x across cluster]
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,684] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 67.33 GB, percent = 3.3% [repeated 3x across cluster]
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,682] [INFO] [utils.py:791:see_memory_usage] end bf16_optimizer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,684] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   activation_checkpointing_config  {
(RewardModelRayActor pid=164150)     "partition_activations": false, 
(RewardModelRayActor pid=164150)     "contiguous_memory_optimization": false, 
(RewardModelRayActor pid=164150)     "cpu_checkpointing": false, 
(RewardModelRayActor pid=164150)     "number_checkpoints": null, 
(RewardModelRayActor pid=164150)     "synchronize_checkpoint_boundary": false, 
(RewardModelRayActor pid=164150)     "profile": false
(RewardModelRayActor pid=164150) } [repeated 6x across cluster]
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   amp_enabled .................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   amp_params ................... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   autotuning_config ............ {
(RewardModelRayActor pid=164150)     "enabled": false,  [repeated 3x across cluster]
(RewardModelRayActor pid=164150)     "start_step": null, 
(RewardModelRayActor pid=164150)     "end_step": null, 
(RewardModelRayActor pid=164150)     "metric_path": null, 
(RewardModelRayActor pid=164150)     "arg_mappings": null, 
(RewardModelRayActor pid=164150)     "metric": "throughput", 
(RewardModelRayActor pid=164150)     "model_info": null, 
(RewardModelRayActor pid=164150)     "results_dir": "autotuning_results", 
(RewardModelRayActor pid=164150)     "exps_dir": "autotuning_exps", 
(RewardModelRayActor pid=164150)     "overwrite": true, 
(RewardModelRayActor pid=164150)     "fast": true, 
(RewardModelRayActor pid=164150)     "start_profile_step": 3, 
(RewardModelRayActor pid=164150)     "end_profile_step": 5, 
(RewardModelRayActor pid=164150)     "tuner_type": "gridsearch", 
(RewardModelRayActor pid=164150)     "tuner_early_stopping": 5, 
(RewardModelRayActor pid=164150)     "tuner_num_trials": 50, 
(RewardModelRayActor pid=164150)     "model_info_path": null, 
(RewardModelRayActor pid=164150)     "mp_size": 1, 
(RewardModelRayActor pid=164150)     "max_train_batch_size": null, 
(RewardModelRayActor pid=164150)     "min_train_batch_size": 1, 
(RewardModelRayActor pid=164150)     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
(RewardModelRayActor pid=164150)     "min_train_micro_batch_size_per_gpu": 1, 
(RewardModelRayActor pid=164150)     "num_tuning_micro_batch_sizes": 3
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   bfloat16_enabled ............. True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   checkpoint_parallel_write_pipeline  False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   checkpoint_tag_validation_enabled  True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   checkpoint_tag_validation_fail  False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f1634a72cb0>
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   communication_data_type ...... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print]   curriculum_enabled_legacy .... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   curriculum_params_legacy ..... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   data_efficiency_enabled ...... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   dataloader_drop_last ......... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   disable_allgather ............ False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   dump_state ................... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   dynamic_loss_scale_args ...... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_enabled ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_gas_boundary_resolution  1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_layer_name ........ bert.encoder.layer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_layer_num ......... 0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_max_iter .......... 100
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_stability ......... 1e-06
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_tol ............... 0.01
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   eigenvalue_verbose ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   elasticity_enabled ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   flops_profiler_config ........ {
(RewardModelRayActor pid=164150)     "recompute_fwd_factor": 0.0, 
(RewardModelRayActor pid=164150)     "profile_step": 1, 
(RewardModelRayActor pid=164150)     "module_depth": -1, 
(RewardModelRayActor pid=164150)     "top_modules": 1, 
(RewardModelRayActor pid=164150)     "detailed": true, 
(RewardModelRayActor pid=164150)     "output_file": null
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   fp16_auto_cast ............... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   fp16_enabled ................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   fp16_master_weights_and_gradients  False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   global_rank .................. 0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   grad_accum_dtype ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   gradient_accumulation_steps .. 32
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   gradient_clipping ............ 1.0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   gradient_predivide_factor .... 1.0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   graph_harvesting ............. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   initial_dynamic_scale ........ 1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   load_universal_checkpoint .... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   loss_scale ................... 1.0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   memory_breakdown ............. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print]   mics_hierarchial_params_gather  False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   mics_shard_size .............. -1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   nebula_config ................ {
(RewardModelRayActor pid=164150)     "persistent_storage_path": null, 
(RewardModelRayActor pid=164150)     "persistent_time_interval": 100, 
(RewardModelRayActor pid=164150)     "num_of_version_in_retention": 2, 
(RewardModelRayActor pid=164150)     "enable_nebula_load": true, 
(RewardModelRayActor pid=164150)     "load_path": null
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   optimizer_legacy_fusion ...... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   optimizer_name ............... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   optimizer_params ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   pld_enabled .................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   pld_params ................... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   prescale_gradients ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   scheduler_name ............... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   scheduler_params ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   seq_parallel_communication_data_type  torch.float32
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   sparse_attention ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   sparse_gradients_enabled ..... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   steps_per_print .............. 100
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   train_batch_size ............. 128
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   train_micro_batch_size_per_gpu  4
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   use_data_before_expert_parallel_  False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   use_node_local_storage ....... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   wall_clock_breakdown ......... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   weight_quantization_config ... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   world_size ................... 1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   zero_allow_untested_optimizer  False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   zero_enabled ................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print]   zero_optimization_stage ...... 0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:974:print_user_config]   json = {
(RewardModelRayActor pid=164150)     "steps_per_print": 100, 
(RewardModelRayActor pid=164150)     "zero_optimization": {
(RewardModelRayActor pid=164150)         "stage": 0, 
(RewardModelRayActor pid=164150)         "stage3_param_persistence_threshold": "auto", 
(RewardModelRayActor pid=164150)         "offload_param": {
(RewardModelRayActor pid=164150)             "device": "none", 
(RewardModelRayActor pid=164150)             "pin_memory": true
(RewardModelRayActor pid=164150)     },  [repeated 2x across cluster]
(RewardModelRayActor pid=164150)     "bf16": {
(RewardModelRayActor pid=164150)         "enabled": true
(RewardModelRayActor pid=164150)     "gradient_clipping": 1.0, 
(RewardModelRayActor pid=164150)     "prescale_gradients": false, 
(RewardModelRayActor pid=164150)     "wall_clock_breakdown": false, 
(RewardModelRayActor pid=164150)     "train_micro_batch_size_per_gpu": 4, 
(RewardModelRayActor pid=164150)     "train_batch_size": 128
(ActorModelRayActor pid=163888) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
(ActorModelRayActor pid=163888) Detected CUDA files, patching ldflags
(ActorModelRayActor pid=163888) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(ActorModelRayActor pid=163888) Building extension module cpu_adam...
(ActorModelRayActor pid=163888) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(ActorModelRayActor pid=163888) ninja: no work to do.
(ActorModelRayActor pid=163888) Time to load cpu_adam op: 2.4911670684814453 seconds
(ActorModelRayActor pid=163888) Loading extension module cpu_adam...
(ActorModelRayActor pid=163692) [2024-02-17 06:33:47,732] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(ActorModelRayActor pid=163692) [2024-02-17 06:33:47,732] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,346] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,347] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,347] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,360] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,360] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,360] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:143:__init__] Reduce bucket size 500,000,000
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:144:__init__] Allgather bucket size 500,000,000
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:145:__init__] CPU Offload: True
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:146:__init__] Round robin gradient partitioning: False
(ActorModelRayActor pid=163692) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(ActorModelRayActor pid=163692) ninja: no work to do.
(ActorModelRayActor pid=163692) Time to load cpu_adam op: 2.5111589431762695 seconds
(ActorModelRayActor pid=163692) [2024-02-17 06:34:11,969] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
(ActorModelRayActor pid=163692) [2024-02-17 06:34:11,970] [INFO] [utils.py:792:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 12.86 GB         Max_CA 13 GB 
(ActorModelRayActor pid=163692) [2024-02-17 06:34:11,970] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 73.06 GB, percent = 3.6%
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,788] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,789] [INFO] [utils.py:792:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 12.86 GB         Max_CA 13 GB 
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,789] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 153.58 GB, percent = 7.6%
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,789] [INFO] [stage_1_and_2.py:533:__init__] optimizer state initialized
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,916] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,916] [INFO] [utils.py:792:see_memory_usage] MA 12.86 GB         Max_MA 12.86 GB         CA 12.86 GB         Max_CA 13 GB 
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,917] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 153.59 GB, percent = 7.6%
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f9759c103d0>
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,923] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,923] [INFO] [config.py:988:print]   activation_checkpointing_config  {
(ActorModelRayActor pid=163692)     "partition_activations": false, 
(ActorModelRayActor pid=163692)     "contiguous_memory_optimization": false, 
(ActorModelRayActor pid=163692)     "cpu_checkpointing": false, 
(ActorModelRayActor pid=163692)     "number_checkpoints": null, 
(ActorModelRayActor pid=163692)     "synchronize_checkpoint_boundary": false, 
(ActorModelRayActor pid=163692)     "profile": false
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   amp_enabled .................. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   amp_params ................... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   autotuning_config ............ {
(ActorModelRayActor pid=163692)     "enabled": false, 
(ActorModelRayActor pid=163692)     "start_step": null, 
(ActorModelRayActor pid=163692)     "end_step": null, 
(ActorModelRayActor pid=163692)     "metric_path": null, 
(ActorModelRayActor pid=163692)     "arg_mappings": null, 
(ActorModelRayActor pid=163692)     "metric": "throughput", 
(ActorModelRayActor pid=163692)     "model_info": null, 
(ActorModelRayActor pid=163692)     "results_dir": "autotuning_results", 
(ActorModelRayActor pid=163692)     "exps_dir": "autotuning_exps", 
(ActorModelRayActor pid=163692)     "overwrite": true, 
(ActorModelRayActor pid=163692)     "fast": true, 
(ActorModelRayActor pid=163692)     "start_profile_step": 3, 
(ActorModelRayActor pid=163692)     "end_profile_step": 5, 
(ActorModelRayActor pid=163692)     "tuner_type": "gridsearch", 
(ActorModelRayActor pid=163692)     "tuner_early_stopping": 5, 
(ActorModelRayActor pid=163692)     "tuner_num_trials": 50, 
(ActorModelRayActor pid=163692)     "model_info_path": null, 
(ActorModelRayActor pid=163692)     "mp_size": 1, 
(ActorModelRayActor pid=163692)     "max_train_batch_size": null, 
(ActorModelRayActor pid=163692)     "min_train_batch_size": 1, 
(ActorModelRayActor pid=163692)     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
(ActorModelRayActor pid=163692)     "min_train_micro_batch_size_per_gpu": 1, 
(ActorModelRayActor pid=163692)     "num_tuning_micro_batch_sizes": 3
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   bfloat16_enabled ............. True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   checkpoint_parallel_write_pipeline  False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   checkpoint_tag_validation_enabled  True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   checkpoint_tag_validation_fail  False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f97301aa3b0>
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   communication_data_type ...... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   curriculum_enabled_legacy .... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   curriculum_params_legacy ..... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   data_efficiency_enabled ...... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   dataloader_drop_last ......... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   disable_allgather ............ False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   dump_state ................... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   dynamic_loss_scale_args ...... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   eigenvalue_enabled ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   eigenvalue_gas_boundary_resolution  1
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   eigenvalue_layer_name ........ bert.encoder.layer
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   eigenvalue_layer_num ......... 0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   eigenvalue_max_iter .......... 100
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   eigenvalue_stability ......... 1e-06
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print]   eigenvalue_tol ............... 0.01
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   eigenvalue_verbose ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   elasticity_enabled ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   flops_profiler_config ........ {
(ActorModelRayActor pid=163692)     "enabled": false, 
(ActorModelRayActor pid=163692)     "recompute_fwd_factor": 0.0, 
(ActorModelRayActor pid=163692)     "profile_step": 1, 
(ActorModelRayActor pid=163692)     "module_depth": -1, 
(ActorModelRayActor pid=163692)     "top_modules": 1, 
(ActorModelRayActor pid=163692)     "detailed": true, 
(ActorModelRayActor pid=163692)     "output_file": null
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   fp16_auto_cast ............... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   fp16_enabled ................. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   fp16_master_weights_and_gradients  False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   global_rank .................. 0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   grad_accum_dtype ............. fp32
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   gradient_accumulation_steps .. 16
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   gradient_clipping ............ 1.0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   gradient_predivide_factor .... 1.0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   graph_harvesting ............. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   initial_dynamic_scale ........ 1
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   load_universal_checkpoint .... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   loss_scale ................... 1.0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   memory_breakdown ............. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   mics_hierarchial_params_gather  False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   mics_shard_size .............. -1
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   nebula_config ................ {
(ActorModelRayActor pid=163692)     "enabled": false, 
(ActorModelRayActor pid=163692)     "persistent_storage_path": null, 
(ActorModelRayActor pid=163692)     "persistent_time_interval": 100, 
(ActorModelRayActor pid=163692)     "num_of_version_in_retention": 2, 
(ActorModelRayActor pid=163692)     "enable_nebula_load": true, 
(ActorModelRayActor pid=163692)     "load_path": null
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   optimizer_legacy_fusion ...... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   optimizer_name ............... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   optimizer_params ............. None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   pld_enabled .................. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   pld_params ................... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   prescale_gradients ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print]   scheduler_name ............... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   scheduler_params ............. None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   seq_parallel_communication_data_type  torch.float32
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   sparse_attention ............. None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   sparse_gradients_enabled ..... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   steps_per_print .............. 100
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   train_batch_size ............. 128
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   train_micro_batch_size_per_gpu  4
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   use_data_before_expert_parallel_  False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   use_node_local_storage ....... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   wall_clock_breakdown ......... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   weight_quantization_config ... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   world_size ................... 2
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   zero_allow_untested_optimizer  False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   zero_enabled ................. True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print]   zero_optimization_stage ...... 2
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:974:print_user_config]   json = {
(ActorModelRayActor pid=163692)     "steps_per_print": 100, 
(ActorModelRayActor pid=163692)     "zero_optimization": {
(ActorModelRayActor pid=163692)         "stage": 2, 
(ActorModelRayActor pid=163692)         "offload_param": {
(ActorModelRayActor pid=163692)             "device": "none"
(ActorModelRayActor pid=163692)         }, 
(ActorModelRayActor pid=163692)         "offload_optimizer": {
(ActorModelRayActor pid=163692)             "device": "cpu", 
(ActorModelRayActor pid=163692)             "pin_memory": true
(ActorModelRayActor pid=163692)         }, 
(ActorModelRayActor pid=163692)         "sub_group_size": "auto", 
(ActorModelRayActor pid=163692)         "stage3_max_live_parameters": "auto", 
(ActorModelRayActor pid=163692)         "stage3_max_reuse_distance": "auto", 
(ActorModelRayActor pid=163692)         "stage3_param_persistence_threshold": "auto", 
(ActorModelRayActor pid=163692)         "stage3_prefetch_bucket_size": "auto", 
(ActorModelRayActor pid=163692)         "reduce_bucket_size": "auto", 
(ActorModelRayActor pid=163692)         "zero_hpz_partition_size": 1, 
(ActorModelRayActor pid=163692)         "zero_quantized_weights": false, 
(ActorModelRayActor pid=163692)         "zero_quantized_gradients": false
(ActorModelRayActor pid=163692)     }, 
(ActorModelRayActor pid=163692)     "bf16": {
(ActorModelRayActor pid=163692)         "enabled": true
(ActorModelRayActor pid=163692)     }, 
(ActorModelRayActor pid=163692)     "gradient_clipping": 1.0, 
(ActorModelRayActor pid=163692)     "prescale_gradients": false, 
(ActorModelRayActor pid=163692)     "wall_clock_breakdown": false, 
(ActorModelRayActor pid=163692)     "data_types": {
(ActorModelRayActor pid=163692)         "grad_accum_dtype": "fp32"
(ActorModelRayActor pid=163692)     }, 
(ActorModelRayActor pid=163692)     "train_micro_batch_size_per_gpu": 4, 
(ActorModelRayActor pid=163692)     "train_batch_size": 128
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) ***** ppo_actor 207 actor model prepared
(CriticModelRayActor pid=163889) [2024-02-17 06:34:31,930] [INFO] [comm.py:637:init_distributed] cdb=None
(CriticModelRayActor pid=163889) [2024-02-17 06:34:31,930] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(CriticModelRayActor pid=163889) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(ActorModelRayActor pid=163692) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
(ActorModelRayActor pid=163692) Detected CUDA files, patching ldflags
(ActorModelRayActor pid=163692) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(ActorModelRayActor pid=163692) Building extension module cpu_adam...
(ActorModelRayActor pid=163692) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(ActorModelRayActor pid=163692) Loading extension module cpu_adam...
(CriticModelRayActor pid=163889) INFO 02-17 06:34:32 model.py:248] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
(CriticModelRayActor pid=163889) INFO 02-17 06:34:32 model.py:248] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:01<00:03,  1.58s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.42s/it]
(CriticModelRayActor pid=164146) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(CriticModelRayActor pid=163889) LLMForSequenceRegression(
(CriticModelRayActor pid=163889)   (model): LlamaModel(
(CriticModelRayActor pid=163889)     (embed_tokens): Embedding(32000, 4096, padding_idx=2)
(CriticModelRayActor pid=163889)     (layers): ModuleList(
(CriticModelRayActor pid=163889)       (0-31): 32 x LlamaDecoderLayer(
(CriticModelRayActor pid=163889)         (self_attn): LlamaFlashAttention2(
(CriticModelRayActor pid=163889)           (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889)           (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889)           (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889)           (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889)           (rotary_emb): LlamaRotaryEmbedding()
(CriticModelRayActor pid=163889)         )
(CriticModelRayActor pid=163889)         (mlp): LlamaMLP(
(CriticModelRayActor pid=163889)           (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(CriticModelRayActor pid=163889)           (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(CriticModelRayActor pid=163889)           (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(CriticModelRayActor pid=163889)           (act_fn): SiLU()
(CriticModelRayActor pid=163889)         )
(CriticModelRayActor pid=163889)         (input_layernorm): LlamaRMSNorm()
(CriticModelRayActor pid=163889)         (post_attention_layernorm): LlamaRMSNorm()
(CriticModelRayActor pid=163889)       )
(CriticModelRayActor pid=163889)     )
(CriticModelRayActor pid=163889)     (norm): LlamaRMSNorm()
(CriticModelRayActor pid=163889)   )
(CriticModelRayActor pid=163889)   (value_head): Linear(in_features=4096, out_features=1, bias=False)
(CriticModelRayActor pid=163889) )
(CriticModelRayActor pid=163889) reward normalization status: True
(CriticModelRayActor pid=163889) mean: tensor([0.5352], dtype=torch.bfloat16), std tensor([1.8750], dtype=torch.bfloat16)
(ActorModelRayActor pid=163888) ***** ppo_actor 207 actor model prepared
(CriticModelRayActor pid=164146) [2024-02-17 06:34:31,931] [INFO] [comm.py:637:init_distributed] cdb=None
(CriticModelRayActor pid=164146) INFO 02-17 06:34:32 model.py:248] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052 [repeated 2x across cluster]
(CriticModelRayActor pid=164146) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(CriticModelRayActor pid=164146) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:03<00:01,  1.69s/it] [repeated 3x across cluster]
(CriticModelRayActor pid=164146) ninja: no work to do.
(CriticModelRayActor pid=164146) Time to load cpu_adam op: 2.4911234378814697 seconds
(CriticModelRayActor pid=164146) Detected CUDA files, patching ldflags
(CriticModelRayActor pid=164146) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(CriticModelRayActor pid=164146) Building extension module cpu_adam...
(CriticModelRayActor pid=164146) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(CriticModelRayActor pid=164146) Loading extension module cpu_adam...
(CriticModelRayActor pid=163889) [2024-02-17 06:34:41,505] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(CriticModelRayActor pid=163889) [2024-02-17 06:34:41,506] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,857] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,858] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,858] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,871] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,871] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:143:__init__] Reduce bucket size 500,000,000
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:144:__init__] Allgather bucket size 500,000,000
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:145:__init__] CPU Offload: True
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:146:__init__] Round robin gradient partitioning: False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:02,983] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
(CriticModelRayActor pid=163889) [2024-02-17 06:35:02,984] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB         Max_MA 12.61 GB         CA 12.62 GB         Max_CA 13 GB 
(CriticModelRayActor pid=163889) [2024-02-17 06:35:02,984] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 183.9 GB, percent = 9.1%
(CriticModelRayActor pid=163889) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(CriticModelRayActor pid=163889) ninja: no work to do.
(CriticModelRayActor pid=163889) Time to load cpu_adam op: 2.4824106693267822 seconds
(CriticModelRayActor pid=164146) ***** Critic model is ready
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,812] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,813] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB         Max_MA 12.61 GB         CA 12.62 GB         Max_CA 13 GB 
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,813] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 263.63 GB, percent = 13.1%
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,813] [INFO] [stage_1_and_2.py:533:__init__] optimizer state initialized
***** async init model done
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,939] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,940] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB         Max_MA 12.61 GB         CA 12.62 GB         Max_CA 13 GB 
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,940] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory:  used = 263.63 GB, percent = 13.1%
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,945] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,945] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,946] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f873f0fd0c0>
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,946] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   activation_checkpointing_config  {
(CriticModelRayActor pid=163889)     "partition_activations": false, 
(CriticModelRayActor pid=163889)     "contiguous_memory_optimization": false, 
(CriticModelRayActor pid=163889)     "cpu_checkpointing": false, 
(CriticModelRayActor pid=163889)     "number_checkpoints": null, 
(CriticModelRayActor pid=163889)     "synchronize_checkpoint_boundary": false, 
(CriticModelRayActor pid=163889)     "profile": false
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   amp_enabled .................. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   amp_params ................... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   autotuning_config ............ {
(CriticModelRayActor pid=163889)     "enabled": false, 
(CriticModelRayActor pid=163889)     "start_step": null, 
(CriticModelRayActor pid=163889)     "end_step": null, 
(CriticModelRayActor pid=163889)     "metric_path": null, 
(CriticModelRayActor pid=163889)     "arg_mappings": null, 
(CriticModelRayActor pid=163889)     "metric": "throughput", 
(CriticModelRayActor pid=163889)     "model_info": null, 
(CriticModelRayActor pid=163889)     "results_dir": "autotuning_results", 
(CriticModelRayActor pid=163889)     "exps_dir": "autotuning_exps", 
(CriticModelRayActor pid=163889)     "overwrite": true, 
(CriticModelRayActor pid=163889)     "fast": true, 
(CriticModelRayActor pid=163889)     "start_profile_step": 3, 
(CriticModelRayActor pid=163889)     "end_profile_step": 5, 
(CriticModelRayActor pid=163889)     "tuner_type": "gridsearch", 
(CriticModelRayActor pid=163889)     "tuner_early_stopping": 5, 
(CriticModelRayActor pid=163889)     "tuner_num_trials": 50, 
(CriticModelRayActor pid=163889)     "model_info_path": null, 
(CriticModelRayActor pid=163889)     "mp_size": 1, 
(CriticModelRayActor pid=163889)     "max_train_batch_size": null, 
(CriticModelRayActor pid=163889)     "min_train_batch_size": 1, 
(CriticModelRayActor pid=163889)     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
(CriticModelRayActor pid=163889)     "min_train_micro_batch_size_per_gpu": 1, 
(CriticModelRayActor pid=163889)     "num_tuning_micro_batch_sizes": 3
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   bfloat16_enabled ............. True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   checkpoint_parallel_write_pipeline  False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   checkpoint_tag_validation_enabled  True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   checkpoint_tag_validation_fail  False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f873c56dd20>
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   communication_data_type ...... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   curriculum_enabled_legacy .... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   curriculum_params_legacy ..... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   data_efficiency_enabled ...... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   dataloader_drop_last ......... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   disable_allgather ............ False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print]   dump_state ................... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   dynamic_loss_scale_args ...... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_enabled ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_gas_boundary_resolution  1
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_layer_name ........ bert.encoder.layer
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_layer_num ......... 0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_max_iter .......... 100
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_stability ......... 1e-06
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_tol ............... 0.01
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   eigenvalue_verbose ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   elasticity_enabled ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   flops_profiler_config ........ {
(CriticModelRayActor pid=163889)     "enabled": false, 
(CriticModelRayActor pid=163889)     "recompute_fwd_factor": 0.0, 
(CriticModelRayActor pid=163889)     "profile_step": 1, 
(CriticModelRayActor pid=163889)     "module_depth": -1, 
(CriticModelRayActor pid=163889)     "top_modules": 1, 
(CriticModelRayActor pid=163889)     "detailed": true, 
(CriticModelRayActor pid=163889)     "output_file": null
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   fp16_auto_cast ............... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   fp16_enabled ................. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   fp16_master_weights_and_gradients  False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   global_rank .................. 0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   grad_accum_dtype ............. fp32
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   gradient_accumulation_steps .. 16
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   gradient_clipping ............ 1.0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   gradient_predivide_factor .... 1.0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   graph_harvesting ............. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   initial_dynamic_scale ........ 1
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   load_universal_checkpoint .... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   loss_scale ................... 1.0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   memory_breakdown ............. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   mics_hierarchial_params_gather  False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   mics_shard_size .............. -1
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   nebula_config ................ {
(CriticModelRayActor pid=163889)     "enabled": false, 
(CriticModelRayActor pid=163889)     "persistent_storage_path": null, 
(CriticModelRayActor pid=163889)     "persistent_time_interval": 100, 
(CriticModelRayActor pid=163889)     "num_of_version_in_retention": 2, 
(CriticModelRayActor pid=163889)     "enable_nebula_load": true, 
(CriticModelRayActor pid=163889)     "load_path": null
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print]   optimizer_legacy_fusion ...... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   optimizer_name ............... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   optimizer_params ............. None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   pld_enabled .................. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   pld_params ................... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   prescale_gradients ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   scheduler_name ............... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   scheduler_params ............. None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   seq_parallel_communication_data_type  torch.float32
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   sparse_attention ............. None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   sparse_gradients_enabled ..... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   steps_per_print .............. 100
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   train_batch_size ............. 128
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   train_micro_batch_size_per_gpu  4
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   use_data_before_expert_parallel_  False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   use_node_local_storage ....... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   wall_clock_breakdown ......... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   weight_quantization_config ... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   world_size ................... 2
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   zero_allow_untested_optimizer  False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   zero_enabled ................. True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print]   zero_optimization_stage ...... 2
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:974:print_user_config]   json = {
(CriticModelRayActor pid=163889)     "steps_per_print": 100, 
(CriticModelRayActor pid=163889)     "zero_optimization": {
(CriticModelRayActor pid=163889)         "stage": 2, 
(CriticModelRayActor pid=163889)         "offload_param": {
(CriticModelRayActor pid=163889)             "device": "none"
(CriticModelRayActor pid=163889)         }, 
(CriticModelRayActor pid=163889)         "offload_optimizer": {
(CriticModelRayActor pid=163889)             "device": "cpu", 
(CriticModelRayActor pid=163889)             "pin_memory": true
(CriticModelRayActor pid=163889)         }, 
(CriticModelRayActor pid=163889)         "sub_group_size": "auto", 
(CriticModelRayActor pid=163889)         "stage3_max_live_parameters": "auto", 
(CriticModelRayActor pid=163889)         "stage3_max_reuse_distance": "auto", 
(CriticModelRayActor pid=163889)         "stage3_param_persistence_threshold": "auto", 
(CriticModelRayActor pid=163889)         "stage3_prefetch_bucket_size": "auto", 
(CriticModelRayActor pid=163889)         "reduce_bucket_size": "auto", 
(CriticModelRayActor pid=163889)         "zero_hpz_partition_size": 1, 
(CriticModelRayActor pid=163889)         "zero_quantized_weights": false, 
(CriticModelRayActor pid=163889)         "zero_quantized_gradients": false
(CriticModelRayActor pid=163889)     }, 
(CriticModelRayActor pid=163889)     "bf16": {
(CriticModelRayActor pid=163889)         "enabled": true
(CriticModelRayActor pid=163889)     }, 
(CriticModelRayActor pid=163889)     "gradient_clipping": 1.0, 
(CriticModelRayActor pid=163889)     "prescale_gradients": false, 
(CriticModelRayActor pid=163889)     "wall_clock_breakdown": false, 
(CriticModelRayActor pid=163889)     "data_types": {
(CriticModelRayActor pid=163889)         "grad_accum_dtype": "fp32"
(CriticModelRayActor pid=163889)     }, 
(CriticModelRayActor pid=163889)     "train_micro_batch_size_per_gpu": 4, 
(CriticModelRayActor pid=163889)     "train_batch_size": 128
(CriticModelRayActor pid=163889) }
(ActorModelRayActor pid=163692) wandb: Currently logged in as: tianhaowu. Use `wandb login --relogin` to force relogin
(ActorModelRayActor pid=163692) wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00,  1.52s/it]
(CriticModelRayActor pid=163889) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
(CriticModelRayActor pid=163889) Detected CUDA files, patching ldflags
(CriticModelRayActor pid=163889) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(CriticModelRayActor pid=163889) Building extension module cpu_adam...
(CriticModelRayActor pid=163889) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(CriticModelRayActor pid=163889) Loading extension module cpu_adam...
(ActorModelRayActor pid=163692) wandb: Tracking run with wandb version 0.16.3
(ActorModelRayActor pid=163692) wandb: Run data is saved locally in /tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/wandb/run-20240217_063522-3w87zqyo
(ActorModelRayActor pid=163692) wandb: Run `wandb offline` to turn off syncing.
(ActorModelRayActor pid=163692) wandb: Syncing run ppo_0217T06:33
(ActorModelRayActor pid=163692) wandb: ⭐️ View project at https://wandb.ai/tianhaowu/openrlhf_train_ppo
(ActorModelRayActor pid=163692) wandb: 🚀 View run at https://wandb.ai/tianhaowu/openrlhf_train_ppo/runs/3w87zqyo
(ActorModelRayActor pid=163888) Adam Optimizer #0 is created with AVX2 arithmetic capability.
(ActorModelRayActor pid=163888) Config: alpha=0.000000, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
(CriticModelRayActor pid=163889) ***** Critic model is ready
(ActorModelRayActor pid=163888) [rank1]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
Traceback (most recent call last):
  File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/examples/train_ppo_ray.py", line 291, in <module>
    train(args)
  File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/examples/train_ppo_ray.py", line 162, in train
    ray.get(refs)
  File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::ActorModelRayActor.fit() (pid=163888, ip=0.0.0.0, actor_id=00f790e87ebcba2952fe737b02000000, repr=<openrlhf.trainer.ray.ppo_actor.ActorModelRayActor object at 0x7f5f703cd390>)
  File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/openrlhf/trainer/ray/ppo_actor.py", line 282, in fit
    trainer = ActorPPOTrainer(
  File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/openrlhf/trainer/ray/ppo_actor.py", line 96, in __init__
    torch.distributed.barrier()
  File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3439, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:550 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f63740f4d87 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x15c0e0b (0x7f604e191e0b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6052460b32 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6052461961 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6052416dd1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6052416dd1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6052416dd1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f601b654c69 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x22b (0x7f601b65bc5b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x10ad03d (0x7f601b66503d in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x21 (0x7f601b6668e1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3bf (0x7f601b6688ff in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0xb0e (0x7f601b677d4e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #13: <unknown function> + 0x5838872 (0x7f6052409872 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5843590 (0x7f6052414590 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x5843695 (0x7f6052414695 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x4e8937c (0x7f6051a5a37c in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x1a08a38 (0x7f604e5d9a38 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x584cca4 (0x7f605241dca4 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x584da55 (0x7f605241ea55 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0xc93e88 (0x7f62b6247e88 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #21: <unknown function> + 0x413ef4 (0x7f62b59c7ef4 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #22: <unknown function> + 0x172df4 (0x556163100df4 in ray::ActorModelRayActor.fit)
frame #23: _PyObject_MakeTpCall + 0x1f8 (0x5561630c7db8 in ray::ActorModelRayActor.fit)
frame #24: <unknown function> + 0xeb5a7 (0x5561630795a7 in ray::ActorModelRayActor.fit)
frame #25: <unknown function> + 0x105bbf (0x556163093bbf in ray::ActorModelRayActor.fit)
frame #26: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #27: _PyObject_Call + 0x1f6 (0x5561630ce3f6 in ray::ActorModelRayActor.fit)
frame #28: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #29: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #30: <unknown function> + 0x10669e (0x55616309469e in ray::ActorModelRayActor.fit)
frame #31: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #32: _PyObject_FastCallDictTstate + 0x162 (0x556163115a92 in ray::ActorModelRayActor.fit)
frame #33: <unknown function> + 0x191f53 (0x55616311ff53 in ray::ActorModelRayActor.fit)
frame #34: <unknown function> + 0x153a21 (0x5561630e1a21 in ray::ActorModelRayActor.fit)
frame #35: _PyObject_Call + 0x259 (0x5561630ce459 in ray::ActorModelRayActor.fit)
frame #36: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #37: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #38: _PyObject_Call + 0xf7 (0x5561630ce2f7 in ray::ActorModelRayActor.fit)
frame #39: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #40: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #41: _PyObject_Call + 0xf7 (0x5561630ce2f7 in ray::ActorModelRayActor.fit)
frame #42: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #43: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #44: PyVectorcall_Call + 0x9c (0x556163023c4c in ray::ActorModelRayActor.fit)
frame #45: <unknown function> + 0x5ade2f (0x7f6377ac5e2f in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #46: <unknown function> + 0x5ef9b8 (0x7f6377b079b8 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #47: <unknown function> + 0x5ade2f (0x7f6377ac5e2f in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #48: <unknown function> + 0x670b3e (0x7f6377b88b3e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #49: std::_Function_handler<ray::Status (ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool, long), ray::Status (*)(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string, std::string, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool, long)>::_M_invoke(std::_Any_data const&, ray::rpc::Address const&, ray::rpc::TaskType&&, std::string&&, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*&&, std::string*&&, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string&&, bool&&, bool&&, bool&&, long&&) + 0x169 (0x7f6377acb509 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #50: ray::core::CoreWorker::ExecuteTask(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*) + 0xc5c (0x7f6377ca918c in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #51: std::_Function_handler<ray::Status (ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*), std::_Bind<ray::Status (ray::core::CoreWorker::*(ray::core::CoreWorker*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>, std::_Placeholder<4>, std::_Placeholder<5>, std::_Placeholder<6>, std::_Placeholder<7>, std::_Placeholder<8>))(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*)> >::_M_invoke(std::_Any_data const&, ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*&&, bool*&&, std::string*&&) + 0x58 (0x7f6377be0f98 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #52: <unknown function> + 0x7b7664 (0x7f6377ccf664 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #53: <unknown function> + 0x7b889a (0x7f6377cd089a in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #54: <unknown function> + 0x7cfe1e (0x7f6377ce7e1e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #55: ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled(ray::TaskID, ray::core::InboundRequest&) + 0x114 (0x7f6377ce8e34 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #56: <unknown function> + 0x7d3a5b (0x7f6377ceba5b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #57: ray::core::ActorSchedulingQueue::Add(long, long, std::function<void (std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status const&, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>, std::string const&, std::shared_ptr<ray::FunctionDescriptorInterface> const&, ray::TaskID, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&) + 0x400 (0x7f6377ced570 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #58: ray::core::CoreWorkerDirectTaskReceiver::HandleTask(ray::rpc::PushTaskRequest const&, ray::rpc::PushTaskReply*, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>) + 0x119c (0x7f6377ccefcc in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #59: <unknown function> + 0x75b6f5 (0x7f6377c736f5 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #60: <unknown function> + 0xa2864e (0x7f6377f4064e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #61: <unknown function> + 0xa21a3e (0x7f6377f39a3e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #62: <unknown function> + 0xa21eb6 (0x7f6377f39eb6 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #63: <unknown function> + 0x10d550b (0x7f63785ed50b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
. This may indicate a possible application crash on rank 0 or a network set up issue.

wuxibin89 commented 4 months ago

also could you share the exact version of libraries by using pip list in your environment ?

thank you so much for the quick response :) hope we can build something cool together

Package                    Version
-------------------------- ------------
accelerate                 0.27.2
aiohttp                    3.9.1
aiohttp-cors               0.7.0
aioprometheus              23.3.0
aiorwlock                  1.3.0
aiosignal                  1.3.1
annotated-types            0.6.0
anyio                      3.7.1
appdirs                    1.4.4
async-timeout              4.0.3
attrs                      23.1.0
bitsandbytes               0.42.0
blessed                    1.20.0
boltons                    23.0.0
brotlipy                   0.7.0
bytedance-context          0.7.1
bytedance.metrics          0.4.0
bytedance.servicediscovery 0.1.2
bytedbackgrounds           0.0.6
bytedenv                   0.6.2
bytedray                   2.6.1
bytedservicediscovery      0.17.4
cachetools                 5.3.2
certifi                    2023.5.7
cffi                       1.15.1
charset-normalizer         2.0.4
click                      8.1.7
coloredlogs                15.0.1
colorful                   0.5.5
conda                      23.5.2
conda-content-trust        0.1.3
conda-libmamba-solver      23.5.0
conda-package-handling     2.1.0
conda_package_streaming    0.8.0
crypto                     1.4.1
cryptography               39.0.1
cupy-cuda11x               12.3.0
datasets                   2.15.0
deepspeed                  0.12.5
dill                       0.3.7
distlib                    0.3.8
docker-pycreds             0.4.0
einops                     0.7.0
exceptiongroup             1.2.0
fastapi                    0.98.0
fastrlock                  0.8.2
filelock                   3.9.0
flash-attn                 2.3.6
frozenlist                 1.3.3
fsspec                     2023.10.0
gitdb                      4.0.11
GitPython                  3.1.40
google-api-core            2.15.0
google-auth                2.25.2
googleapis-common-protos   1.62.0
gpustat                    1.0.0
grpcio                     1.59.3
h11                        0.14.0
hjson                      3.1.0
httptools                  0.6.1
huggingface-hub            0.20.1
humanfriendly              10.0
idna                       3.4
ipaddress                  1.0.23
isort                      5.13.2
Jinja2                     3.1.2
jsonlines                  4.0.0
jsonpatch                  1.32
jsonpointer                2.1
jsonschema                 4.17.3
jsonschema-specifications  2023.11.2
libmambapy                 1.4.1
lightning-utilities        0.10.1
loralib                    0.1.2
MarkupSafe                 2.1.3
mpmath                     1.3.0
msgpack                    1.0.7
multidict                  6.0.4
multiprocess               0.70.15
Naked                      0.1.32
networkx                   3.0
ninja                      1.11.1.1
numpy                      1.26.2
nvidia-cublas-cu12         12.1.3.1
nvidia-cuda-cupti-cu12     12.1.105
nvidia-cuda-nvrtc-cu12     12.1.105
nvidia-cuda-runtime-cu12   12.1.105
nvidia-cudnn-cu12          8.9.2.26
nvidia-cufft-cu12          11.0.2.54
nvidia-curand-cu12         10.3.2.106
nvidia-cusolver-cu12       11.4.5.107
nvidia-cusparse-cu12       12.1.0.106
nvidia-ml-py               11.495.46
nvidia-nccl-cu12           2.18.1
nvidia-nvjitlink-cu12      12.3.101
nvidia-nvtx-cu12           12.1.105
opencensus                 0.11.3
opencensus-context         0.1.3
optimum                    1.17.1
orjson                     3.9.10
packaging                  23.0
pandas                     2.1.4
peft                       0.8.2
Pillow                     9.3.0
pip                        23.1.2
platformdirs               3.11.0
pluggy                     1.0.0
prometheus-client          0.13.1
protobuf                   3.20.3
psutil                     5.9.7
py-cpuinfo                 9.0.0
py-spy                     0.3.14
pyarrow                    14.0.2
pyarrow-hotfix             0.6
pyasn1                     0.5.1
pyasn1-modules             0.3.0
pycosat                    0.6.4
pycparser                  2.21
pycryptodome               3.18.0
pydantic                   1.10.13
pydantic_core              2.14.5
pynvml                     11.5.0
pyOpenSSL                  23.0.0
pyrsistent                 0.20.0
PySocks                    1.7.1
python-dateutil            2.8.2
python-dotenv              1.0.0
pytz                       2023.3.post1
PyYAML                     6.0.1
quantile-python            1.1
referencing                0.32.0
regex                      2023.10.3
requests                   2.29.0
rpds-py                    0.15.2
rsa                        4.9
ruamel.yaml                0.17.21
ruamel.yaml.clib           0.2.6
safetensors                0.4.1
schedule                   1.2.1
scipy                      1.12.0
sentencepiece              0.1.99
sentry-sdk                 1.39.1
setproctitle               1.3.3
setuptools                 67.8.0
shellescape                3.8.1
six                        1.16.0
smart-open                 6.4.0
smmap                      5.0.1
sniffio                    1.3.0
starlette                  0.27.0
sympy                      1.12
tabulate                   0.9.0
tensorboardX               2.6.2.2
tokenizers                 0.15.0
toolz                      0.12.0
torch                      2.1.1+cu118
torchaudio                 2.1.1+cu118
torchmetrics               1.3.1
torchvision                0.16.1+cu118
tqdm                       4.65.0
transformers               4.37.1
triton                     2.1.0
typing_extensions          4.9.0
tzdata                     2023.3
urllib3                    1.26.16
uvicorn                    0.21.1
uvloop                     0.19.0
virtualenv                 20.21.0
vllm                       0.2.3+cu118
wandb                      0.16.1
watchfiles                 0.21.0
wcwidth                    0.2.12
websockets                 12.0
wheel                      0.38.4
xformers                   0.0.23+cu118
xxhash                     3.4.1
yarl                       1.9.4
zstandard                  0.19.0

tianhao-nexusflow commented 4 months ago

also could you share the exact version of libraries by using pip list in your environment ? thank you so much for the quick response :) hope we can build something cool together

Package                    Version
-------------------------- ------------
accelerate                 0.27.2
aiohttp                    3.9.1
aiohttp-cors               0.7.0
aioprometheus              23.3.0
aiorwlock                  1.3.0
aiosignal                  1.3.1
annotated-types            0.6.0
anyio                      3.7.1
appdirs                    1.4.4
async-timeout              4.0.3
attrs                      23.1.0
bitsandbytes               0.42.0
blessed                    1.20.0
boltons                    23.0.0
brotlipy                   0.7.0
bytedance-context          0.7.1
bytedance.metrics          0.4.0
bytedance.servicediscovery 0.1.2
bytedbackgrounds           0.0.6
bytedenv                   0.6.2
bytedray                   2.6.1
bytedservicediscovery      0.17.4
cachetools                 5.3.2
certifi                    2023.5.7
cffi                       1.15.1
charset-normalizer         2.0.4
click                      8.1.7
coloredlogs                15.0.1
colorful                   0.5.5
conda                      23.5.2
conda-content-trust        0.1.3
conda-libmamba-solver      23.5.0
conda-package-handling     2.1.0
conda_package_streaming    0.8.0
crypto                     1.4.1
cryptography               39.0.1
cupy-cuda11x               12.3.0
datasets                   2.15.0
deepspeed                  0.12.5
dill                       0.3.7
distlib                    0.3.8
docker-pycreds             0.4.0
einops                     0.7.0
exceptiongroup             1.2.0
fastapi                    0.98.0
fastrlock                  0.8.2
filelock                   3.9.0
flash-attn                 2.3.6
frozenlist                 1.3.3
fsspec                     2023.10.0
gitdb                      4.0.11
GitPython                  3.1.40
google-api-core            2.15.0
google-auth                2.25.2
googleapis-common-protos   1.62.0
gpustat                    1.0.0
grpcio                     1.59.3
h11                        0.14.0
hjson                      3.1.0
httptools                  0.6.1
huggingface-hub            0.20.1
humanfriendly              10.0
idna                       3.4
ipaddress                  1.0.23
isort                      5.13.2
Jinja2                     3.1.2
jsonlines                  4.0.0
jsonpatch                  1.32
jsonpointer                2.1
jsonschema                 4.17.3
jsonschema-specifications  2023.11.2
libmambapy                 1.4.1
lightning-utilities        0.10.1
loralib                    0.1.2
MarkupSafe                 2.1.3
mpmath                     1.3.0
msgpack                    1.0.7
multidict                  6.0.4
multiprocess               0.70.15
Naked                      0.1.32
networkx                   3.0
ninja                      1.11.1.1
numpy                      1.26.2
nvidia-cublas-cu12         12.1.3.1
nvidia-cuda-cupti-cu12     12.1.105
nvidia-cuda-nvrtc-cu12     12.1.105
nvidia-cuda-runtime-cu12   12.1.105
nvidia-cudnn-cu12          8.9.2.26
nvidia-cufft-cu12          11.0.2.54
nvidia-curand-cu12         10.3.2.106
nvidia-cusolver-cu12       11.4.5.107
nvidia-cusparse-cu12       12.1.0.106
nvidia-ml-py               11.495.46
nvidia-nccl-cu12           2.18.1
nvidia-nvjitlink-cu12      12.3.101
nvidia-nvtx-cu12           12.1.105
opencensus                 0.11.3
opencensus-context         0.1.3
optimum                    1.17.1
orjson                     3.9.10
packaging                  23.0
pandas                     2.1.4
peft                       0.8.2
Pillow                     9.3.0
pip                        23.1.2
platformdirs               3.11.0
pluggy                     1.0.0
prometheus-client          0.13.1
protobuf                   3.20.3
psutil                     5.9.7
py-cpuinfo                 9.0.0
py-spy                     0.3.14
pyarrow                    14.0.2
pyarrow-hotfix             0.6
pyasn1                     0.5.1
pyasn1-modules             0.3.0
pycosat                    0.6.4
pycparser                  2.21
pycryptodome               3.18.0
pydantic                   1.10.13
pydantic_core              2.14.5
pynvml                     11.5.0
pyOpenSSL                  23.0.0
pyrsistent                 0.20.0
PySocks                    1.7.1
python-dateutil            2.8.2
python-dotenv              1.0.0
pytz                       2023.3.post1
PyYAML                     6.0.1
quantile-python            1.1
referencing                0.32.0
regex                      2023.10.3
requests                   2.29.0
rpds-py                    0.15.2
rsa                        4.9
ruamel.yaml                0.17.21
ruamel.yaml.clib           0.2.6
safetensors                0.4.1
schedule                   1.2.1
scipy                      1.12.0
sentencepiece              0.1.99
sentry-sdk                 1.39.1
setproctitle               1.3.3
setuptools                 67.8.0
shellescape                3.8.1
six                        1.16.0
smart-open                 6.4.0
smmap                      5.0.1
sniffio                    1.3.0
starlette                  0.27.0
sympy                      1.12
tabulate                   0.9.0
tensorboardX               2.6.2.2
tokenizers                 0.15.0
toolz                      0.12.0
torch                      2.1.1+cu118
torchaudio                 2.1.1+cu118
torchmetrics               1.3.1
torchvision                0.16.1+cu118
tqdm                       4.65.0
transformers               4.37.1
triton                     2.1.0
typing_extensions          4.9.0
tzdata                     2023.3
urllib3                    1.26.16
uvicorn                    0.21.1
uvloop                     0.19.0
virtualenv                 20.21.0
vllm                       0.2.3+cu118
wandb                      0.16.1
watchfiles                 0.21.0
wcwidth                    0.2.12
websockets                 12.0
wheel                      0.38.4
xformers                   0.0.23+cu118
xxhash                     3.4.1
yarl                       1.9.4
zstandard                  0.19.0

Thx for the information!!! Here is my pip list:

vllm                          0.3.0+cu123 /workspace/vllm-fork
torch                         2.2.0

Can it be related to the vllm version?

wuxibin89 commented 4 months ago

@tianhao-nexusflow I don't think it's related to vllm. Is your cuda version 12.3?

tianhao-nexusflow commented 4 months ago

@tianhao-nexusflow I don't think it's related to vllm. Is your cuda version 12.3?

Package                       Version     Editable project location
----------------------------- ----------- -------------------------
accelerate                    0.27.2
aiohttp                       3.9.3
aiohttp-cors                  0.7.0
aioprometheus                 23.12.0
aiosignal                     1.3.1
annotated-types               0.6.0
anyio                         4.2.0
appdirs                       1.4.4
async-timeout                 4.0.3
attrs                         23.2.0
bitsandbytes                  0.42.0
blessed                       1.20.0
cachetools                    5.3.2
certifi                       2024.2.2
charset-normalizer            3.3.2
click                         8.1.7
coloredlogs                   15.0.1
colorful                      0.5.6
cupy-cuda12x                  12.1.0
datasets                      2.17.0
deepspeed                     0.13.1
dill                          0.3.8
distlib                       0.3.8
docker-pycreds                0.4.0
einops                        0.7.0
exceptiongroup                1.2.0
fastapi                       0.109.2
fastrlock                     0.8.2
filelock                      3.13.1
flash-attn                    2.5.3
frozenlist                    1.4.1
fsspec                        2023.10.0
gitdb                         4.0.11
GitPython                     3.1.42
google-api-core               2.17.1
google-auth                   2.28.0
googleapis-common-protos      1.62.0
gpustat                       1.1.1
grpcio                        1.60.1
h11                           0.14.0
hjson                         3.1.0
httptools                     0.6.1
huggingface-hub               0.20.3
humanfriendly                 10.0
idna                          3.6
isort                         5.13.2
Jinja2                        3.1.3
jsonlines                     4.0.0
jsonschema                    4.21.1
jsonschema-specifications     2023.12.1
lightning-utilities           0.10.1
loralib                       0.1.2
MarkupSafe                    2.1.5
mpmath                        1.3.0
msgpack                       1.0.7
multidict                     6.0.5
multiprocess                  0.70.16
networkx                      3.2.1
ninja                         1.11.1.1
numpy                         1.26.4
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu12             8.9.2.26
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-ml-py                  12.535.133
nvidia-nccl-cu12              2.19.3
nvidia-nvjitlink-cu12         12.3.101
nvidia-nvtx-cu12              12.1.105
opencensus                    0.11.4
opencensus-context            0.1.3
openrlhf                      0.1.9       /workspace/OpenRLHF
optimum                       1.16.2
orjson                        3.9.14
packaging                     23.2
pandas                        2.2.0
peft                          0.8.2
pip                           23.3.1
platformdirs                  4.2.0
prometheus_client             0.20.0
protobuf                      4.25.3
psutil                        5.9.8
py-cpuinfo                    9.0.0
py-spy                        0.3.14
pyarrow                       15.0.0
pyarrow-hotfix                0.6
pyasn1                        0.5.1
pyasn1-modules                0.3.0
pydantic                      2.6.1
pydantic_core                 2.16.2
pynvml                        11.5.0
python-dateutil               2.8.2
python-dotenv                 1.0.1
pytz                          2024.1
PyYAML                        6.0.1
quantile-python               1.1
ray                           2.9.2
referencing                   0.33.0
regex                         2023.12.25
requests                      2.31.0
rpds-py                       0.18.0
rsa                           4.9
safetensors                   0.4.2
scipy                         1.12.0
sentencepiece                 0.1.99
sentry-sdk                    1.40.4
setproctitle                  1.3.3
setuptools                    68.2.2
six                           1.16.0
smart-open                    6.4.0
smmap                         5.0.1
sniffio                       1.3.0
starlette                     0.36.3
sympy                         1.12
tokenizers                    0.15.2
torch                         2.2.0
torchmetrics                  1.3.1
tqdm                          4.66.2
transformers                  4.37.1
transformers-stream-generator 0.0.4
triton                        2.2.0
typing_extensions             4.9.0
tzdata                        2024.1
urllib3                       2.2.0
uvicorn                       0.27.1
uvloop                        0.19.0
virtualenv                    20.25.0
vllm                          0.3.0+cu123 /workspace/vllm-fork
wandb                         0.16.3
watchfiles                    0.21.0
wcwidth                       0.2.13
websockets                    12.0
wheel                         0.41.2
xformers                      0.0.24
xxhash                        3.4.1
yarl                          1.9.4

Given the pip list, I think the cuda version is 12.1?

wuxibin89 commented 4 months ago

@tianhao-nexusflow Can you post your run command and hardware info?

tianhao-nexusflow commented 4 months ago

@tianhao-nexusflow Can you post your run command and hardware info?

Sure! the run command is:

set -x 
export PATH=$HOME/.local/bin/:$PATH

ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{"working_dir": "/workspace/OpenRLHF"}' \
    -- python3 examples/train_ppo_ray.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 1 \
    --reward_num_nodes 1 \
    --reward_num_gpus_per_node 1 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 2 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 2 \
    --vllm_num_engines 1 \
    --vllm_tensor_parallel_size 1 \
    --pretrain OpenLLMAI/Llama-2-7b-sft-model-ocra-500k \
    --reward_pretrain OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt \
    --save_path /openrlhf/examples/test_scripts/ckpt/7b_llama \
    --micro_train_batch_size 4 \
    --train_batch_size 128 \
    --micro_rollout_batch_size 8 \
    --rollout_batch_size 1024 \
    --max_epochs 1 \
    --prompt_max_len 1024 \
    --generate_max_len 1024 \
    --zero_stage 2 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --init_kl_coef 0.01 \
    --prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward \
    --prompt_data_probs 0.4,0.5,0.1 \
    --max_samples 80000 \
    --normalize_reward \
    --actor_init_on_gpu \
    --adam_offload \
    --flash_attn \
    --gradient_checkpointing

For the hardware:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   26C    P0              65W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:0F:00.0 Off |                    0 |
| N/A   25C    P0              61W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:47:00.0 Off |                    0 |
| N/A   27C    P0              61W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:4E:00.0 Off |                    0 |
| N/A   27C    P0              61W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  | 00000000:87:00.0 Off |                    0 |
| N/A   30C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  | 00000000:90:00.0 Off |                    0 |
| N/A   31C    P0              63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  | 00000000:B7:00.0 Off |                    0 |
| N/A   30C    P0              63W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  | 00000000:BD:00.0 Off |                    0 |
| N/A   31C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

wuxibin89 commented 4 months ago

@tianhao-nexusflow I can't reproduce with your script either, let me switch to cuda 12 and torch 2.2.

hijkzzz commented 4 months ago

@wuxibin89 I found that there are some compatibility issues between the vLLM and Pytorch 23.12 docker images. It seems that going forward we should provide a dedicated image for users.

wuxibin89 commented 4 months ago

@tianhao-nexusflow Do you build vllm from source with torch==2.2? I found that vllm==0.3.0 is build with torch==2.1.2

hijkzzz commented 4 months ago

This should be a problem caused by a container environment. @wuxibin89 can run VLLM well with his container image. It hangs with nvidia pytorch 23.12. @karthik19967829

karthik19967829 commented 4 months ago

great thanks for the inputs team @hijkzzz if you guys can share a docker image that works that will be great

if you are able to get it working on top of HF container or something similar should make the setup seamless

hijkzzz commented 4 months ago

OK ! pip install vllm==0.2.4 is all you need !! OpenRLHF will be compatible with vllm=0.3.1 as soon as possible

@karthik19967829

karthik19967829 commented 4 months ago

Great thanks @hijkzzz

karthik19967829 commented 4 months ago

we are testing it out, will ping you once we observe the logs

tianhao-nexusflow commented 4 months ago

OK ! pip install vllm==0.2.4 is all you need !! OpenRLHF will be compatible with vllm=0.3.1 as soon as possible

@karthik19967829

@hijkzzz Thanks for your insightful comment! I've followed the updated README, however when I tried to

ray start --head --node-ip-address 0.0.0.0 --num-gpus 8,

I get

bash: ray: command not found

However when I pip list, everything is installed. I guess it's related to PYTHONPATH but haven't figure it out.

hijkzzz commented 4 months ago

OK ! pip install vllm==0.2.4 is all you need !! OpenRLHF will be compatible with vllm=0.3.1 as soon as possible @karthik19967829

@hijkzzz Thanks for your insightful comment! I've followed the updated README, however when I tried to
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8, 
I get
bash: ray: command not found
However when I pip list, everything is installed. I guess it's related to PYTHONPATH but haven't figure it out.

./build_openrlhf.sh ~/.local/bin/ray

wuxibin89 commented 4 months ago

@karthik19967829 @tianhao-nexusflow This problem is related to vllm version, we have some monkey patch to vllm for weight synchronization between vllm and actor model.

In vllm==0.2.7, they do a major change to their architecture, which break our monkey patch. For quick fix, you can downgrade vllm to <=0.2.6. I will fix this very soon. https://github.com/vllm-project/vllm/pull/2221

karthik19967829 commented 4 months ago

Great thanks @wuxibin89 , after downgrading zero2+vllm is running but the throughput is not very different than vanilla , need to benchmark more closely , but whats the general practice to increase speed up ?

wuxibin89 commented 4 months ago

@karthik19967829 For PPO RLHF, sequence generation is the major bottleneck, increasing vllm_num_engines can reduce the generation time. For 8 GPUs and 7B model, you can try (ref=1, reward=1, actor=2, critic=2, vllm=2).

hijkzzz commented 4 months ago

@karthik19967829 Our distributed design is for models above 13b, please set up more GPUs for vllm. for 7b models, you could try openrlhf without ray

wuxibin89 commented 4 months ago

@karthik19967829 @tianhao-nexusflow Fixed in https://github.com/OpenLLMAI/OpenRLHF/pull/215, I have tested vllm==0.2.3 and vllm==0.3.1 with vllm_tensor_parallel_size=1/2, all tests have passed.

tianhao-nexusflow commented 4 months ago

@karthik19967829 @tianhao-nexusflow Fixed in #215, I have tested vllm==0.2.3 and vllm==0.3.1 with vllm_tensor_parallel_size=1/2, all tests have passed.

@wuxibin89 @hijkzzz Thx! It works now, appreciate the help!!!

karthik19967829 commented 4 months ago

@wuxibin89 @hijkzzz thank you so much for the quick help! best open-source team I have met!

hijkzzz commented 2 months ago

vllm==0.4.1 also hangs @wuxibin89

hijkzzz commented 2 months ago

OpenLLMAI / OpenRLHF

vllm +zero2 hangs #211