Open karthik19967829 opened 4 months ago
For the 7b model, it seems you can try without vllm
yup we want to run 34B+ models , we were testing the vllm setup with 7b m as it works without it fine
@karthik19967829 I can't reproduce this problem with your script, my job is succeeded as expect. Can your post ray job supervisor's log? You can find it at /tmp/ray/session_latest/logs/job-driver-raysubmit_{JOBID}.log
cool will do that thanks , I am using 1 node with 8 GPUs , may I know your exact hardware setup and run command ?
My hardware info is 1 node with 8 A100 GPUs, and run command is:
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "."}' \
--no-wait \
-- python3 examples/train_ppo_ray.py \
--ref_num_nodes 1 \
--ref_num_gpus_per_node 1 \
--reward_num_nodes 1 \
--reward_num_gpus_per_node 1 \
--critic_num_nodes 1 \
--critic_num_gpus_per_node 1 \
--actor_num_nodes 1 \
--actor_num_gpus_per_node 4 \
--vllm_num_engines 1 \
--vllm_tensor_parallel_size 1 \
--pretrain mistralai/Mistral-7B-v0.1 \
--reward_pretrain mistralai/Mistral-7B-v0.1 \
--save_path /openrlhf/examples/scripts/ckpt/starling_7b \
--micro_train_batch_size 4 \
--train_batch_size 128 \
--micro_rollout_batch_size 16 \
--rollout_batch_size 256 \
--max_epochs 1 \
--prompt_max_len 1024 \
--generate_max_len 1024 \
--zero_stage 2 \
--bf16 \
--actor_learning_rate 2e-7 \
--critic_learning_rate 3e-6 \
--init_kl_coef 0.001 \
--prompt_data Open-Orca/OpenOrca \
--prompt_data_probs 1 \
--max_samples 256 \
--actor_init_on_gpu \
--adam_offload \
--gradient_checkpointing
also could you share the exact version of libraries by using pip list
in your environment ?
thank you so much for the quick response :) hope we can build something cool together
@wuxibin89 I'm encountering the same problem, and this is in /tmp/ray/sessionlatest/logs/job-driver-raysubmit{JOBID}.log:
(openrlhf) root@401e005161f6:/tmp/ray/session_latest/logs# cat job-driver-raysubmit_BqhS5Hp5fuBzecG1.log
[2024-02-17 06:33:00,146] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
2024-02-17 06:33:04,141 INFO worker.py:1405 -- Using address 0.0.0.0:6379 set in the environment variable RAY_ADDRESS
2024-02-17 06:33:04,141 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 0.0.0.0:6379...
2024-02-17 06:33:04,148 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
(pid=163692) [2024-02-17 06:33:07,082] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(pid=163692) /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
(pid=163692) warnings.warn(
***** constructed actor model: {actor_model}
***** constructed critic model: {critic_model}
***** constructed reference model: {ref_model}
***** constructed reward models: {reward_models}
(ActorModelRayActor pid=163692) [2024-02-17 06:33:14,446] [INFO] [comm.py:637:init_distributed] cdb=None
(ActorModelRayActor pid=163692) [2024-02-17 06:33:14,446] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(pid=163889) [2024-02-17 06:33:11,717] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
***** constructed vLLM engines: {vllm_engines}
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
(pid=163888) /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations [repeated 2x across cluster]
(pid=163888) warnings.warn( [repeated 2x across cluster]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:01<00:03, 1.72s/it]
(RewardModelRayActor pid=164150) INFO 02-17 06:33:19 model.py:190] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
(RewardModelRayActor pid=164150) INFO 02-17 06:33:19 model.py:190] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
(RewardModelRayActor pid=164150) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00, 1.52s/it]
(ActorModelRayActor pid=163692) Actor(
(ActorModelRayActor pid=163692) (model): LlamaForCausalLM(
(ActorModelRayActor pid=163692) (model): LlamaModel(
(ActorModelRayActor pid=163692) (embed_tokens): Embedding(32000, 4096)
(ActorModelRayActor pid=163692) (layers): ModuleList(
(ActorModelRayActor pid=163692) (0-31): 32 x LlamaDecoderLayer(
(ActorModelRayActor pid=163692) (self_attn): LlamaFlashAttention2(
(ActorModelRayActor pid=163692) (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692) (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692) (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692) (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(ActorModelRayActor pid=163692) (rotary_emb): LlamaRotaryEmbedding()
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) (mlp): LlamaMLP(
(ActorModelRayActor pid=163692) (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(ActorModelRayActor pid=163692) (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(ActorModelRayActor pid=163692) (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(ActorModelRayActor pid=163692) (act_fn): SiLU()
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) (input_layernorm): LlamaRMSNorm()
(ActorModelRayActor pid=163692) (post_attention_layernorm): LlamaRMSNorm()
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) (norm): LlamaRMSNorm()
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) )
(ActorModelRayActor pid=163692) dataset: Open-Orca/OpenOrca
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:19,311] [INFO] [comm.py:637:init_distributed] cdb=None [repeated 3x across cluster]
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:19,311] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [repeated 2x across cluster]
(pid=164245) [2024-02-17 06:33:17,109] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [repeated 4x across cluster]
(LLMRayActor pid=164245) INFO 02-17 06:33:19 llm_engine.py:79] Initializing an LLM engine with config: model='OpenLLMAI/Llama-2-7b-sft-model-ocra-500k', tokenizer='OpenLLMAI/Llama-2-7b-sft-model-ocra-500k', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=42)
(LLMRayActor pid=164245) INFO 02-17 06:33:21 weight_utils.py:163] Using model weights format ['*.safetensors']
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] [repeated 3x across cluster]
(pid=164245) /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations [repeated 4x across cluster]
(pid=164245) warnings.warn( [repeated 4x across cluster]
(ActorModelRayActor pid=163692) dataset: Dahoas/full-hh-rlhf
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:03<00:01, 1.55s/it] [repeated 6x across cluster]
(LLMRayActor pid=164245) INFO 02-17 06:33:23 llm_engine.py:337] # GPU blocks: 7406, # CPU blocks: 512
(ActorModelRayActor pid=163692) dataset: tasksource/oasst1_pairwise_rlhf_reward
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:24,309] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:24,309] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
(RewardModelRayActor pid=164150) LLMForSequenceRegression(
(RewardModelRayActor pid=164150) (value_head): Linear(in_features=4096, out_features=1, bias=False)
(RewardModelRayActor pid=164150) reward normalization status: True
(RewardModelRayActor pid=164150) mean: tensor([0.5352], dtype=torch.bfloat16), std tensor([1.8750], dtype=torch.bfloat16)
(LLMRayActor pid=164245) INFO 02-17 06:33:24 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(LLMRayActor pid=164245) INFO 02-17 06:33:24 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,359] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,360] [INFO] [logging.py:96:log_dist] [Rank 0] Creating BF16 optimizer
(ReferenceModelRayActor pid=164147) Actor(
(ReferenceModelRayActor pid=164147) (model): LlamaForCausalLM(
(RewardModelRayActor pid=164150) (model): LlamaModel( [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (embed_tokens): Embedding(32000, 4096, padding_idx=2) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (layers): ModuleList( [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (0-31): 32 x LlamaDecoderLayer( [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (self_attn): LlamaFlashAttention2( [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (q_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (k_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (v_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (o_proj): Linear(in_features=4096, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (rotary_emb): LlamaRotaryEmbedding() [repeated 2x across cluster]
(RewardModelRayActor pid=164150) ) [repeated 13x across cluster]
(RewardModelRayActor pid=164150) (mlp): LlamaMLP( [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (gate_proj): Linear(in_features=4096, out_features=11008, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (up_proj): Linear(in_features=4096, out_features=11008, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (down_proj): Linear(in_features=11008, out_features=4096, bias=False) [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (act_fn): SiLU() [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (input_layernorm): LlamaRMSNorm() [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (post_attention_layernorm): LlamaRMSNorm() [repeated 2x across cluster]
(RewardModelRayActor pid=164150) (norm): LlamaRMSNorm() [repeated 2x across cluster]
(ReferenceModelRayActor pid=164147) (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,475] [INFO] [utils.py:791:see_memory_usage] begin bf16_optimizer
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,476] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB Max_MA 12.61 GB CA 12.62 GB Max_CA 13 GB
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,476] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 67.32 GB, percent = 3.3%
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,595] [INFO] [utils.py:791:see_memory_usage] end bf16_optimizer
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] activation_checkpointing_config {
(ReferenceModelRayActor pid=164147) "partition_activations": false,
(ReferenceModelRayActor pid=164147) "contiguous_memory_optimization": false,
(ReferenceModelRayActor pid=164147) "cpu_checkpointing": false,
(ReferenceModelRayActor pid=164147) "number_checkpoints": null,
(ReferenceModelRayActor pid=164147) "synchronize_checkpoint_boundary": false,
(ReferenceModelRayActor pid=164147) "profile": false
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] amp_enabled .................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] amp_params ................... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] autotuning_config ............ {
(ReferenceModelRayActor pid=164147) "enabled": false,
(ReferenceModelRayActor pid=164147) "start_step": null,
(ReferenceModelRayActor pid=164147) "end_step": null,
(ReferenceModelRayActor pid=164147) "metric_path": null,
(ReferenceModelRayActor pid=164147) "arg_mappings": null,
(ReferenceModelRayActor pid=164147) "metric": "throughput",
(ReferenceModelRayActor pid=164147) "model_info": null,
(ReferenceModelRayActor pid=164147) "results_dir": "autotuning_results",
(ReferenceModelRayActor pid=164147) "exps_dir": "autotuning_exps",
(ReferenceModelRayActor pid=164147) "overwrite": true,
(ReferenceModelRayActor pid=164147) "fast": true,
(ReferenceModelRayActor pid=164147) "start_profile_step": 3,
(ReferenceModelRayActor pid=164147) "end_profile_step": 5,
(ReferenceModelRayActor pid=164147) "tuner_type": "gridsearch",
(ReferenceModelRayActor pid=164147) "tuner_early_stopping": 5,
(ReferenceModelRayActor pid=164147) "tuner_num_trials": 50,
(ReferenceModelRayActor pid=164147) "model_info_path": null,
(ReferenceModelRayActor pid=164147) "mp_size": 1,
(ReferenceModelRayActor pid=164147) "max_train_batch_size": null,
(ReferenceModelRayActor pid=164147) "min_train_batch_size": 1,
(ReferenceModelRayActor pid=164147) "max_train_micro_batch_size_per_gpu": 1.024000e+03,
(ReferenceModelRayActor pid=164147) "min_train_micro_batch_size_per_gpu": 1,
(ReferenceModelRayActor pid=164147) "num_tuning_micro_batch_sizes": 3
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] bfloat16_enabled ............. True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,597] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa42f7a1810>
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] communication_data_type ...... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] dataloader_drop_last ......... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] disable_allgather ............ False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] dump_state ................... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] elasticity_enabled ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] flops_profiler_config ........ {
(ReferenceModelRayActor pid=164147) "enabled": false,
(ReferenceModelRayActor pid=164147) "recompute_fwd_factor": 0.0,
(ReferenceModelRayActor pid=164147) "profile_step": 1,
(ReferenceModelRayActor pid=164147) "module_depth": -1,
(ReferenceModelRayActor pid=164147) "top_modules": 1,
(ReferenceModelRayActor pid=164147) "detailed": true,
(ReferenceModelRayActor pid=164147) "output_file": null
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] fp16_auto_cast ............... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] fp16_enabled ................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] global_rank .................. 0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] grad_accum_dtype ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] gradient_accumulation_steps .. 32
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] graph_harvesting ............. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,598] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] load_universal_checkpoint .... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] loss_scale ................... 1.0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] memory_breakdown ............. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] mics_shard_size .............. -1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] nebula_config ................ {
(ReferenceModelRayActor pid=164147) "enabled": false,
(ReferenceModelRayActor pid=164147) "persistent_storage_path": null,
(ReferenceModelRayActor pid=164147) "persistent_time_interval": 100,
(ReferenceModelRayActor pid=164147) "num_of_version_in_retention": 2,
(ReferenceModelRayActor pid=164147) "enable_nebula_load": true,
(ReferenceModelRayActor pid=164147) "load_path": null
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] optimizer_name ............... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] optimizer_params ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] pld_enabled .................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] pld_params ................... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] prescale_gradients ........... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] scheduler_name ............... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] scheduler_params ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] sparse_attention ............. None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] steps_per_print .............. 100
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] train_batch_size ............. 128
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 4
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] use_node_local_storage ....... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] weight_quantization_config ... None
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] world_size ................... 1
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] zero_allow_untested_optimizer False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] zero_enabled ................. False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,599] [INFO] [config.py:988:print] zero_optimization_stage ...... 0
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,600] [INFO] [config.py:974:print_user_config] json = {
(ReferenceModelRayActor pid=164147) "steps_per_print": 100,
(ReferenceModelRayActor pid=164147) "zero_optimization": {
(ReferenceModelRayActor pid=164147) "stage": 0,
(ReferenceModelRayActor pid=164147) "stage3_param_persistence_threshold": "auto",
(ReferenceModelRayActor pid=164147) "offload_param": {
(ReferenceModelRayActor pid=164147) "device": "none",
(ReferenceModelRayActor pid=164147) "pin_memory": true
(ReferenceModelRayActor pid=164147) }
(ReferenceModelRayActor pid=164147) },
(ReferenceModelRayActor pid=164147) "bf16": {
(ReferenceModelRayActor pid=164147) "enabled": true
(ReferenceModelRayActor pid=164147) },
(ReferenceModelRayActor pid=164147) "gradient_clipping": 1.0,
(ReferenceModelRayActor pid=164147) "prescale_gradients": false,
(ReferenceModelRayActor pid=164147) "wall_clock_breakdown": false,
(ReferenceModelRayActor pid=164147) "train_micro_batch_size_per_gpu": 4,
(ReferenceModelRayActor pid=164147) "train_batch_size": 128
(ReferenceModelRayActor pid=164147) }
(ActorModelRayActor pid=163692) [Dataset({
(ActorModelRayActor pid=163692) features: ['id', 'system_prompt', 'question', 'response'],
(ActorModelRayActor pid=163692) num_rows: 80000
(ActorModelRayActor pid=163692) }), Dataset({
(ActorModelRayActor pid=163692) features: ['prompt', 'response', 'chosen', 'rejected'],
(ActorModelRayActor pid=163692) num_rows: 80000
(ActorModelRayActor pid=163692) }), Dataset({
(ActorModelRayActor pid=163692) features: ['lang', 'parent_id', 'prompt', 'chosen', 'rejected'],
(ActorModelRayActor pid=163692) num_rows: 17966
(ActorModelRayActor pid=163692) })]
(LLMRayActor pid=164245) INFO 02-17 06:33:30 model_runner.py:738] Graph capturing finished in 5 secs.
(RewardModelRayActor pid=164150) [2024-02-17 06:33:24,423] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(RewardModelRayActor pid=164150) [2024-02-17 06:33:24,424] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
0%| | 0/80000 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00, 1.47s/it] [repeated 3x across cluster]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:03<00:01, 1.62s/it]
1%| | 951/80000 [00:00<00:08, 9504.34it/s]
2%|▏ | 1917/80000 [00:00<00:08, 9590.07it/s]
4%|▎ | 2885/80000 [00:00<00:08, 9627.75it/s]
5%|▍ | 3848/80000 [00:00<00:07, 9600.03it/s]
6%|▌ | 4813/80000 [00:00<00:07, 9615.42it/s]
7%|▋ | 5775/80000 [00:00<00:07, 9605.78it/s]
8%|▊ | 6740/80000 [00:00<00:07, 9619.43it/s]
10%|▉ | 7708/80000 [00:00<00:07, 9636.96it/s]
11%|█ | 8672/80000 [00:00<00:07, 9619.40it/s]
12%|█▏ | 9634/80000 [00:01<00:07, 9611.21it/s]
13%|█▎ | 10596/80000 [00:01<00:07, 9600.97it/s]
16%|█▌ | 12520/80000 [00:01<00:07, 9574.41it/s]
17%|█▋ | 13478/80000 [00:01<00:06, 9571.38it/s]
18%|█▊ | 14436/80000 [00:01<00:06, 9571.39it/s]
19%|█▉ | 15394/80000 [00:01<00:06, 9565.31it/s]
20%|██ | 16351/80000 [00:01<00:06, 9560.12it/s]
22%|██▏ | 17308/80000 [00:01<00:06, 9552.46it/s]
24%|██▍ | 19222/80000 [00:02<00:06, 9558.89it/s]
25%|██▌ | 20178/80000 [00:02<00:06, 9554.53it/s]
26%|██▋ | 21135/80000 [00:02<00:06, 9556.74it/s]
28%|██▊ | 22097/80000 [00:02<00:06, 9574.49it/s]
29%|██▉ | 23055/80000 [00:02<00:06, 9075.43it/s]
30%|███ | 24006/80000 [00:02<00:06, 9200.13it/s]
31%|███ | 24955/80000 [00:02<00:05, 9284.46it/s]
32%|███▏ | 25900/80000 [00:02<00:05, 9331.45it/s]
34%|███▎ | 26841/80000 [00:02<00:05, 9352.24it/s]
35%|███▍ | 27787/80000 [00:02<00:05, 9383.24it/s]
36%|███▌ | 28727/80000 [00:03<00:05, 9318.18it/s]
37%|███▋ | 29678/80000 [00:03<00:05, 9373.53it/s]
38%|███▊ | 30639/80000 [00:03<00:05, 9442.38it/s]
39%|███▉ | 31593/80000 [00:03<00:05, 9468.62it/s]
41%|████ | 32541/80000 [00:03<00:05, 9358.85it/s]
42%|████▏ | 33492/80000 [00:03<00:04, 9403.62it/s]
43%|████▎ | 34433/80000 [00:03<00:04, 9123.18it/s]
44%|████▍ | 35385/80000 [00:03<00:04, 9238.95it/s]
45%|████▌ | 36342/80000 [00:03<00:04, 9334.38it/s]
47%|████▋ | 37277/80000 [00:03<00:04, 9083.53it/s]
48%|████▊ | 38232/80000 [00:04<00:04, 9217.42it/s]
49%|████▉ | 39196/80000 [00:04<00:04, 9340.17it/s]
50%|█████ | 40157/80000 [00:04<00:04, 9418.07it/s]
51%|█████▏ | 41101/80000 [00:04<00:04, 9133.29it/s]
53%|█████▎ | 42049/80000 [00:04<00:04, 9234.17it/s]
54%|█████▍ | 43003/80000 [00:04<00:03, 9322.26it/s]
55%|█████▍ | 43960/80000 [00:04<00:03, 9393.00it/s]
56%|█████▌ | 44901/80000 [00:04<00:03, 9107.29it/s]
57%|█████▋ | 45860/80000 [00:04<00:03, 9246.97it/s]
60%|█████▉ | 47772/80000 [00:05<00:03, 9405.55it/s]
61%|██████ | 48736/80000 [00:05<00:03, 9475.03it/s]
62%|██████▏ | 49694/80000 [00:05<00:03, 9504.24it/s]
63%|██████▎ | 50646/80000 [00:05<00:03, 9144.76it/s]
65%|██████▍ | 51605/80000 [00:05<00:03, 9272.29it/s]
66%|██████▌ | 52563/80000 [00:05<00:02, 9361.13it/s]
67%|██████▋ | 53522/80000 [00:05<00:02, 9426.24it/s]
68%|██████▊ | 54474/80000 [00:05<00:02, 9452.18it/s]
70%|███████ | 56377/80000 [00:06<00:02, 9158.12it/s]
72%|███████▏ | 57338/80000 [00:06<00:02, 9287.46it/s]
73%|███████▎ | 58298/80000 [00:06<00:02, 9377.12it/s]
74%|███████▍ | 59241/80000 [00:06<00:02, 9392.38it/s]
75%|███████▌ | 60190/80000 [00:06<00:02, 9419.29it/s]
76%|███████▋ | 61151/80000 [00:06<00:01, 9475.33it/s]
79%|███████▉ | 63053/80000 [00:06<00:01, 9481.31it/s]
80%|████████ | 64010/80000 [00:06<00:01, 9505.16it/s]
81%|████████ | 64961/80000 [00:06<00:01, 9503.69it/s]
82%|████████▏ | 65912/80000 [00:07<00:01, 9491.31it/s]
84%|████████▎ | 66862/80000 [00:07<00:01, 9468.92it/s]
85%|████████▍ | 67810/80000 [00:07<00:01, 9469.67it/s]
86%|████████▌ | 68759/80000 [00:07<00:01, 9473.26it/s]
87%|████████▋ | 69707/80000 [00:07<00:01, 9344.96it/s]
88%|████████▊ | 70642/80000 [00:07<00:01, 9255.83it/s]
89%|████████▉ | 71592/80000 [00:07<00:00, 9326.48it/s]
91%|█████████ | 72532/80000 [00:07<00:00, 9346.13it/s]
92%|█████████▏| 73484/80000 [00:07<00:00, 9395.20it/s]
93%|█████████▎| 74424/80000 [00:07<00:00, 9289.45it/s]
94%|█████████▍| 75373/80000 [00:08<00:00, 9346.70it/s]
97%|█████████▋| 77260/80000 [00:08<00:00, 9156.91it/s]
98%|█████████▊| 78216/80000 [00:08<00:00, 9272.86it/s]
99%|█████████▉| 79164/80000 [00:08<00:00, 9333.33it/s]
100%|██████████| 80000/80000 [00:08<00:00, 9377.14it/s]
(ActorModelRayActor pid=163888) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,355] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(ReferenceModelRayActor pid=164147) [2024-02-17 06:33:26,357] [INFO] [logging.py:96:log_dist] [Rank 0] Creating BF16 optimizer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,519] [INFO] [utils.py:791:see_memory_usage] begin bf16_optimizer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,683] [INFO] [utils.py:792:see_memory_usage] MA 12.37 GB Max_MA 12.37 GB CA 12.37 GB Max_CA 12 GB [repeated 3x across cluster]
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,684] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 67.33 GB, percent = 3.3% [repeated 3x across cluster]
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,682] [INFO] [utils.py:791:see_memory_usage] end bf16_optimizer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,684] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] activation_checkpointing_config {
(RewardModelRayActor pid=164150) "partition_activations": false,
(RewardModelRayActor pid=164150) "contiguous_memory_optimization": false,
(RewardModelRayActor pid=164150) "cpu_checkpointing": false,
(RewardModelRayActor pid=164150) "number_checkpoints": null,
(RewardModelRayActor pid=164150) "synchronize_checkpoint_boundary": false,
(RewardModelRayActor pid=164150) "profile": false
(RewardModelRayActor pid=164150) } [repeated 6x across cluster]
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] amp_enabled .................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] amp_params ................... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] autotuning_config ............ {
(RewardModelRayActor pid=164150) "enabled": false, [repeated 3x across cluster]
(RewardModelRayActor pid=164150) "start_step": null,
(RewardModelRayActor pid=164150) "end_step": null,
(RewardModelRayActor pid=164150) "metric_path": null,
(RewardModelRayActor pid=164150) "arg_mappings": null,
(RewardModelRayActor pid=164150) "metric": "throughput",
(RewardModelRayActor pid=164150) "model_info": null,
(RewardModelRayActor pid=164150) "results_dir": "autotuning_results",
(RewardModelRayActor pid=164150) "exps_dir": "autotuning_exps",
(RewardModelRayActor pid=164150) "overwrite": true,
(RewardModelRayActor pid=164150) "fast": true,
(RewardModelRayActor pid=164150) "start_profile_step": 3,
(RewardModelRayActor pid=164150) "end_profile_step": 5,
(RewardModelRayActor pid=164150) "tuner_type": "gridsearch",
(RewardModelRayActor pid=164150) "tuner_early_stopping": 5,
(RewardModelRayActor pid=164150) "tuner_num_trials": 50,
(RewardModelRayActor pid=164150) "model_info_path": null,
(RewardModelRayActor pid=164150) "mp_size": 1,
(RewardModelRayActor pid=164150) "max_train_batch_size": null,
(RewardModelRayActor pid=164150) "min_train_batch_size": 1,
(RewardModelRayActor pid=164150) "max_train_micro_batch_size_per_gpu": 1.024000e+03,
(RewardModelRayActor pid=164150) "min_train_micro_batch_size_per_gpu": 1,
(RewardModelRayActor pid=164150) "num_tuning_micro_batch_sizes": 3
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] bfloat16_enabled ............. True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f1634a72cb0>
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] communication_data_type ...... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,685] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] dataloader_drop_last ......... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] disable_allgather ............ False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] dump_state ................... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] elasticity_enabled ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] flops_profiler_config ........ {
(RewardModelRayActor pid=164150) "recompute_fwd_factor": 0.0,
(RewardModelRayActor pid=164150) "profile_step": 1,
(RewardModelRayActor pid=164150) "module_depth": -1,
(RewardModelRayActor pid=164150) "top_modules": 1,
(RewardModelRayActor pid=164150) "detailed": true,
(RewardModelRayActor pid=164150) "output_file": null
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] fp16_auto_cast ............... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] fp16_enabled ................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] global_rank .................. 0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] grad_accum_dtype ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] gradient_accumulation_steps .. 32
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] graph_harvesting ............. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] load_universal_checkpoint .... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] loss_scale ................... 1.0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] memory_breakdown ............. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,686] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] mics_shard_size .............. -1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] nebula_config ................ {
(RewardModelRayActor pid=164150) "persistent_storage_path": null,
(RewardModelRayActor pid=164150) "persistent_time_interval": 100,
(RewardModelRayActor pid=164150) "num_of_version_in_retention": 2,
(RewardModelRayActor pid=164150) "enable_nebula_load": true,
(RewardModelRayActor pid=164150) "load_path": null
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] optimizer_name ............... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] optimizer_params ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] pld_enabled .................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] pld_params ................... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] prescale_gradients ........... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] scheduler_name ............... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] scheduler_params ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] sparse_attention ............. None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] steps_per_print .............. 100
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] train_batch_size ............. 128
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 4
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] use_node_local_storage ....... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] weight_quantization_config ... None
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] world_size ................... 1
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] zero_allow_untested_optimizer False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] zero_enabled ................. False
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:988:print] zero_optimization_stage ...... 0
(RewardModelRayActor pid=164150) [2024-02-17 06:33:26,687] [INFO] [config.py:974:print_user_config] json = {
(RewardModelRayActor pid=164150) "steps_per_print": 100,
(RewardModelRayActor pid=164150) "zero_optimization": {
(RewardModelRayActor pid=164150) "stage": 0,
(RewardModelRayActor pid=164150) "stage3_param_persistence_threshold": "auto",
(RewardModelRayActor pid=164150) "offload_param": {
(RewardModelRayActor pid=164150) "device": "none",
(RewardModelRayActor pid=164150) "pin_memory": true
(RewardModelRayActor pid=164150) }, [repeated 2x across cluster]
(RewardModelRayActor pid=164150) "bf16": {
(RewardModelRayActor pid=164150) "enabled": true
(RewardModelRayActor pid=164150) "gradient_clipping": 1.0,
(RewardModelRayActor pid=164150) "prescale_gradients": false,
(RewardModelRayActor pid=164150) "wall_clock_breakdown": false,
(RewardModelRayActor pid=164150) "train_micro_batch_size_per_gpu": 4,
(RewardModelRayActor pid=164150) "train_batch_size": 128
(ActorModelRayActor pid=163888) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
(ActorModelRayActor pid=163888) Detected CUDA files, patching ldflags
(ActorModelRayActor pid=163888) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(ActorModelRayActor pid=163888) Building extension module cpu_adam...
(ActorModelRayActor pid=163888) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(ActorModelRayActor pid=163888) ninja: no work to do.
(ActorModelRayActor pid=163888) Time to load cpu_adam op: 2.4911670684814453 seconds
(ActorModelRayActor pid=163888) Loading extension module cpu_adam...
(ActorModelRayActor pid=163692) [2024-02-17 06:33:47,732] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(ActorModelRayActor pid=163692) [2024-02-17 06:33:47,732] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,346] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,347] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,347] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,360] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,360] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,360] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:143:__init__] Reduce bucket size 500,000,000
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:144:__init__] Allgather bucket size 500,000,000
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:145:__init__] CPU Offload: True
(ActorModelRayActor pid=163692) [2024-02-17 06:33:52,361] [INFO] [stage_1_and_2.py:146:__init__] Round robin gradient partitioning: False
(ActorModelRayActor pid=163692) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(ActorModelRayActor pid=163692) ninja: no work to do.
(ActorModelRayActor pid=163692) Time to load cpu_adam op: 2.5111589431762695 seconds
(ActorModelRayActor pid=163692) [2024-02-17 06:34:11,969] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
(ActorModelRayActor pid=163692) [2024-02-17 06:34:11,970] [INFO] [utils.py:792:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 12.86 GB Max_CA 13 GB
(ActorModelRayActor pid=163692) [2024-02-17 06:34:11,970] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 73.06 GB, percent = 3.6%
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,788] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,789] [INFO] [utils.py:792:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 12.86 GB Max_CA 13 GB
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,789] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 153.58 GB, percent = 7.6%
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,789] [INFO] [stage_1_and_2.py:533:__init__] optimizer state initialized
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,916] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,916] [INFO] [utils.py:792:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 12.86 GB Max_CA 13 GB
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,917] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 153.59 GB, percent = 7.6%
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f9759c103d0>
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,922] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,923] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,923] [INFO] [config.py:988:print] activation_checkpointing_config {
(ActorModelRayActor pid=163692) "partition_activations": false,
(ActorModelRayActor pid=163692) "contiguous_memory_optimization": false,
(ActorModelRayActor pid=163692) "cpu_checkpointing": false,
(ActorModelRayActor pid=163692) "number_checkpoints": null,
(ActorModelRayActor pid=163692) "synchronize_checkpoint_boundary": false,
(ActorModelRayActor pid=163692) "profile": false
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] amp_enabled .................. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] amp_params ................... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] autotuning_config ............ {
(ActorModelRayActor pid=163692) "enabled": false,
(ActorModelRayActor pid=163692) "start_step": null,
(ActorModelRayActor pid=163692) "end_step": null,
(ActorModelRayActor pid=163692) "metric_path": null,
(ActorModelRayActor pid=163692) "arg_mappings": null,
(ActorModelRayActor pid=163692) "metric": "throughput",
(ActorModelRayActor pid=163692) "model_info": null,
(ActorModelRayActor pid=163692) "results_dir": "autotuning_results",
(ActorModelRayActor pid=163692) "exps_dir": "autotuning_exps",
(ActorModelRayActor pid=163692) "overwrite": true,
(ActorModelRayActor pid=163692) "fast": true,
(ActorModelRayActor pid=163692) "start_profile_step": 3,
(ActorModelRayActor pid=163692) "end_profile_step": 5,
(ActorModelRayActor pid=163692) "tuner_type": "gridsearch",
(ActorModelRayActor pid=163692) "tuner_early_stopping": 5,
(ActorModelRayActor pid=163692) "tuner_num_trials": 50,
(ActorModelRayActor pid=163692) "model_info_path": null,
(ActorModelRayActor pid=163692) "mp_size": 1,
(ActorModelRayActor pid=163692) "max_train_batch_size": null,
(ActorModelRayActor pid=163692) "min_train_batch_size": 1,
(ActorModelRayActor pid=163692) "max_train_micro_batch_size_per_gpu": 1.024000e+03,
(ActorModelRayActor pid=163692) "min_train_micro_batch_size_per_gpu": 1,
(ActorModelRayActor pid=163692) "num_tuning_micro_batch_sizes": 3
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] bfloat16_enabled ............. True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f97301aa3b0>
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] communication_data_type ...... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] dataloader_drop_last ......... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] disable_allgather ............ False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] dump_state ................... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,924] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] elasticity_enabled ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] flops_profiler_config ........ {
(ActorModelRayActor pid=163692) "enabled": false,
(ActorModelRayActor pid=163692) "recompute_fwd_factor": 0.0,
(ActorModelRayActor pid=163692) "profile_step": 1,
(ActorModelRayActor pid=163692) "module_depth": -1,
(ActorModelRayActor pid=163692) "top_modules": 1,
(ActorModelRayActor pid=163692) "detailed": true,
(ActorModelRayActor pid=163692) "output_file": null
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] fp16_auto_cast ............... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] fp16_enabled ................. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] global_rank .................. 0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] grad_accum_dtype ............. fp32
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] gradient_accumulation_steps .. 16
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] graph_harvesting ............. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] load_universal_checkpoint .... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] loss_scale ................... 1.0
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] memory_breakdown ............. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] mics_shard_size .............. -1
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] nebula_config ................ {
(ActorModelRayActor pid=163692) "enabled": false,
(ActorModelRayActor pid=163692) "persistent_storage_path": null,
(ActorModelRayActor pid=163692) "persistent_time_interval": 100,
(ActorModelRayActor pid=163692) "num_of_version_in_retention": 2,
(ActorModelRayActor pid=163692) "enable_nebula_load": true,
(ActorModelRayActor pid=163692) "load_path": null
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] optimizer_name ............... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] optimizer_params ............. None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] pld_enabled .................. False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] pld_params ................... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] prescale_gradients ........... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,925] [INFO] [config.py:988:print] scheduler_name ............... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] scheduler_params ............. None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] sparse_attention ............. None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] steps_per_print .............. 100
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] train_batch_size ............. 128
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 4
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] use_node_local_storage ....... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] weight_quantization_config ... None
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] world_size ................... 2
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] zero_allow_untested_optimizer False
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] zero_enabled ................. True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:988:print] zero_optimization_stage ...... 2
(ActorModelRayActor pid=163692) [2024-02-17 06:34:31,926] [INFO] [config.py:974:print_user_config] json = {
(ActorModelRayActor pid=163692) "steps_per_print": 100,
(ActorModelRayActor pid=163692) "zero_optimization": {
(ActorModelRayActor pid=163692) "stage": 2,
(ActorModelRayActor pid=163692) "offload_param": {
(ActorModelRayActor pid=163692) "device": "none"
(ActorModelRayActor pid=163692) },
(ActorModelRayActor pid=163692) "offload_optimizer": {
(ActorModelRayActor pid=163692) "device": "cpu",
(ActorModelRayActor pid=163692) "pin_memory": true
(ActorModelRayActor pid=163692) },
(ActorModelRayActor pid=163692) "sub_group_size": "auto",
(ActorModelRayActor pid=163692) "stage3_max_live_parameters": "auto",
(ActorModelRayActor pid=163692) "stage3_max_reuse_distance": "auto",
(ActorModelRayActor pid=163692) "stage3_param_persistence_threshold": "auto",
(ActorModelRayActor pid=163692) "stage3_prefetch_bucket_size": "auto",
(ActorModelRayActor pid=163692) "reduce_bucket_size": "auto",
(ActorModelRayActor pid=163692) "zero_hpz_partition_size": 1,
(ActorModelRayActor pid=163692) "zero_quantized_weights": false,
(ActorModelRayActor pid=163692) "zero_quantized_gradients": false
(ActorModelRayActor pid=163692) },
(ActorModelRayActor pid=163692) "bf16": {
(ActorModelRayActor pid=163692) "enabled": true
(ActorModelRayActor pid=163692) },
(ActorModelRayActor pid=163692) "gradient_clipping": 1.0,
(ActorModelRayActor pid=163692) "prescale_gradients": false,
(ActorModelRayActor pid=163692) "wall_clock_breakdown": false,
(ActorModelRayActor pid=163692) "data_types": {
(ActorModelRayActor pid=163692) "grad_accum_dtype": "fp32"
(ActorModelRayActor pid=163692) },
(ActorModelRayActor pid=163692) "train_micro_batch_size_per_gpu": 4,
(ActorModelRayActor pid=163692) "train_batch_size": 128
(ActorModelRayActor pid=163692) }
(ActorModelRayActor pid=163692) ***** ppo_actor 207 actor model prepared
(CriticModelRayActor pid=163889) [2024-02-17 06:34:31,930] [INFO] [comm.py:637:init_distributed] cdb=None
(CriticModelRayActor pid=163889) [2024-02-17 06:34:31,930] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
(CriticModelRayActor pid=163889) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(ActorModelRayActor pid=163692) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
(ActorModelRayActor pid=163692) Detected CUDA files, patching ldflags
(ActorModelRayActor pid=163692) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(ActorModelRayActor pid=163692) Building extension module cpu_adam...
(ActorModelRayActor pid=163692) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(ActorModelRayActor pid=163692) Loading extension module cpu_adam...
(CriticModelRayActor pid=163889) INFO 02-17 06:34:32 model.py:248] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
(CriticModelRayActor pid=163889) INFO 02-17 06:34:32 model.py:248] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:01<00:03, 1.58s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00, 1.42s/it]
(CriticModelRayActor pid=164146) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(CriticModelRayActor pid=163889) LLMForSequenceRegression(
(CriticModelRayActor pid=163889) (model): LlamaModel(
(CriticModelRayActor pid=163889) (embed_tokens): Embedding(32000, 4096, padding_idx=2)
(CriticModelRayActor pid=163889) (layers): ModuleList(
(CriticModelRayActor pid=163889) (0-31): 32 x LlamaDecoderLayer(
(CriticModelRayActor pid=163889) (self_attn): LlamaFlashAttention2(
(CriticModelRayActor pid=163889) (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889) (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889) (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889) (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(CriticModelRayActor pid=163889) (rotary_emb): LlamaRotaryEmbedding()
(CriticModelRayActor pid=163889) )
(CriticModelRayActor pid=163889) (mlp): LlamaMLP(
(CriticModelRayActor pid=163889) (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(CriticModelRayActor pid=163889) (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(CriticModelRayActor pid=163889) (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(CriticModelRayActor pid=163889) (act_fn): SiLU()
(CriticModelRayActor pid=163889) )
(CriticModelRayActor pid=163889) (input_layernorm): LlamaRMSNorm()
(CriticModelRayActor pid=163889) (post_attention_layernorm): LlamaRMSNorm()
(CriticModelRayActor pid=163889) )
(CriticModelRayActor pid=163889) )
(CriticModelRayActor pid=163889) (norm): LlamaRMSNorm()
(CriticModelRayActor pid=163889) )
(CriticModelRayActor pid=163889) (value_head): Linear(in_features=4096, out_features=1, bias=False)
(CriticModelRayActor pid=163889) )
(CriticModelRayActor pid=163889) reward normalization status: True
(CriticModelRayActor pid=163889) mean: tensor([0.5352], dtype=torch.bfloat16), std tensor([1.8750], dtype=torch.bfloat16)
(ActorModelRayActor pid=163888) ***** ppo_actor 207 actor model prepared
(CriticModelRayActor pid=164146) [2024-02-17 06:34:31,931] [INFO] [comm.py:637:init_distributed] cdb=None
(CriticModelRayActor pid=164146) INFO 02-17 06:34:32 model.py:248] Monkey patch for Flash Attention, see https://github.com/huggingface/transformers/issues/28052 [repeated 2x across cluster]
(CriticModelRayActor pid=164146) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(CriticModelRayActor pid=164146) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 67%|██████▋ | 2/3 [00:03<00:01, 1.69s/it] [repeated 3x across cluster]
(CriticModelRayActor pid=164146) ninja: no work to do.
(CriticModelRayActor pid=164146) Time to load cpu_adam op: 2.4911234378814697 seconds
(CriticModelRayActor pid=164146) Detected CUDA files, patching ldflags
(CriticModelRayActor pid=164146) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(CriticModelRayActor pid=164146) Building extension module cpu_adam...
(CriticModelRayActor pid=164146) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(CriticModelRayActor pid=164146) Loading extension module cpu_adam...
(CriticModelRayActor pid=163889) [2024-02-17 06:34:41,505] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.1, git-hash=unknown, git-branch=unknown
(CriticModelRayActor pid=163889) [2024-02-17 06:34:41,506] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,857] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,858] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,858] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,871] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,871] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:143:__init__] Reduce bucket size 500,000,000
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:144:__init__] Allgather bucket size 500,000,000
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:145:__init__] CPU Offload: True
(CriticModelRayActor pid=163889) [2024-02-17 06:34:44,872] [INFO] [stage_1_and_2.py:146:__init__] Round robin gradient partitioning: False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:02,983] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
(CriticModelRayActor pid=163889) [2024-02-17 06:35:02,984] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB Max_MA 12.61 GB CA 12.62 GB Max_CA 13 GB
(CriticModelRayActor pid=163889) [2024-02-17 06:35:02,984] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 183.9 GB, percent = 9.1%
(CriticModelRayActor pid=163889) Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
(CriticModelRayActor pid=163889) ninja: no work to do.
(CriticModelRayActor pid=163889) Time to load cpu_adam op: 2.4824106693267822 seconds
(CriticModelRayActor pid=164146) ***** Critic model is ready
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,812] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,813] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB Max_MA 12.61 GB CA 12.62 GB Max_CA 13 GB
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,813] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 263.63 GB, percent = 13.1%
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,813] [INFO] [stage_1_and_2.py:533:__init__] optimizer state initialized
***** async init model done
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,939] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,940] [INFO] [utils.py:792:see_memory_usage] MA 12.61 GB Max_MA 12.61 GB CA 12.62 GB Max_CA 13 GB
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,940] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 263.63 GB, percent = 13.1%
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,945] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,945] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,946] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f873f0fd0c0>
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,946] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)]
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,946] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] activation_checkpointing_config {
(CriticModelRayActor pid=163889) "partition_activations": false,
(CriticModelRayActor pid=163889) "contiguous_memory_optimization": false,
(CriticModelRayActor pid=163889) "cpu_checkpointing": false,
(CriticModelRayActor pid=163889) "number_checkpoints": null,
(CriticModelRayActor pid=163889) "synchronize_checkpoint_boundary": false,
(CriticModelRayActor pid=163889) "profile": false
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] amp_enabled .................. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] amp_params ................... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] autotuning_config ............ {
(CriticModelRayActor pid=163889) "enabled": false,
(CriticModelRayActor pid=163889) "start_step": null,
(CriticModelRayActor pid=163889) "end_step": null,
(CriticModelRayActor pid=163889) "metric_path": null,
(CriticModelRayActor pid=163889) "arg_mappings": null,
(CriticModelRayActor pid=163889) "metric": "throughput",
(CriticModelRayActor pid=163889) "model_info": null,
(CriticModelRayActor pid=163889) "results_dir": "autotuning_results",
(CriticModelRayActor pid=163889) "exps_dir": "autotuning_exps",
(CriticModelRayActor pid=163889) "overwrite": true,
(CriticModelRayActor pid=163889) "fast": true,
(CriticModelRayActor pid=163889) "start_profile_step": 3,
(CriticModelRayActor pid=163889) "end_profile_step": 5,
(CriticModelRayActor pid=163889) "tuner_type": "gridsearch",
(CriticModelRayActor pid=163889) "tuner_early_stopping": 5,
(CriticModelRayActor pid=163889) "tuner_num_trials": 50,
(CriticModelRayActor pid=163889) "model_info_path": null,
(CriticModelRayActor pid=163889) "mp_size": 1,
(CriticModelRayActor pid=163889) "max_train_batch_size": null,
(CriticModelRayActor pid=163889) "min_train_batch_size": 1,
(CriticModelRayActor pid=163889) "max_train_micro_batch_size_per_gpu": 1.024000e+03,
(CriticModelRayActor pid=163889) "min_train_micro_batch_size_per_gpu": 1,
(CriticModelRayActor pid=163889) "num_tuning_micro_batch_sizes": 3
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] bfloat16_enabled ............. True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f873c56dd20>
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] communication_data_type ...... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] dataloader_drop_last ......... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] disable_allgather ............ False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,947] [INFO] [config.py:988:print] dump_state ................... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] elasticity_enabled ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] flops_profiler_config ........ {
(CriticModelRayActor pid=163889) "enabled": false,
(CriticModelRayActor pid=163889) "recompute_fwd_factor": 0.0,
(CriticModelRayActor pid=163889) "profile_step": 1,
(CriticModelRayActor pid=163889) "module_depth": -1,
(CriticModelRayActor pid=163889) "top_modules": 1,
(CriticModelRayActor pid=163889) "detailed": true,
(CriticModelRayActor pid=163889) "output_file": null
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] fp16_auto_cast ............... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] fp16_enabled ................. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] global_rank .................. 0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] grad_accum_dtype ............. fp32
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] gradient_accumulation_steps .. 16
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] graph_harvesting ............. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] load_universal_checkpoint .... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] loss_scale ................... 1.0
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] memory_breakdown ............. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] mics_shard_size .............. -1
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] nebula_config ................ {
(CriticModelRayActor pid=163889) "enabled": false,
(CriticModelRayActor pid=163889) "persistent_storage_path": null,
(CriticModelRayActor pid=163889) "persistent_time_interval": 100,
(CriticModelRayActor pid=163889) "num_of_version_in_retention": 2,
(CriticModelRayActor pid=163889) "enable_nebula_load": true,
(CriticModelRayActor pid=163889) "load_path": null
(CriticModelRayActor pid=163889) }
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,948] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] optimizer_name ............... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] optimizer_params ............. None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] pld_enabled .................. False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] pld_params ................... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] prescale_gradients ........... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] scheduler_name ............... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] scheduler_params ............. None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] sparse_attention ............. None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] steps_per_print .............. 100
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] train_batch_size ............. 128
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 4
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] use_node_local_storage ....... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] weight_quantization_config ... None
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] world_size ................... 2
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] zero_allow_untested_optimizer False
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] zero_enabled ................. True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:988:print] zero_optimization_stage ...... 2
(CriticModelRayActor pid=163889) [2024-02-17 06:35:20,949] [INFO] [config.py:974:print_user_config] json = {
(CriticModelRayActor pid=163889) "steps_per_print": 100,
(CriticModelRayActor pid=163889) "zero_optimization": {
(CriticModelRayActor pid=163889) "stage": 2,
(CriticModelRayActor pid=163889) "offload_param": {
(CriticModelRayActor pid=163889) "device": "none"
(CriticModelRayActor pid=163889) },
(CriticModelRayActor pid=163889) "offload_optimizer": {
(CriticModelRayActor pid=163889) "device": "cpu",
(CriticModelRayActor pid=163889) "pin_memory": true
(CriticModelRayActor pid=163889) },
(CriticModelRayActor pid=163889) "sub_group_size": "auto",
(CriticModelRayActor pid=163889) "stage3_max_live_parameters": "auto",
(CriticModelRayActor pid=163889) "stage3_max_reuse_distance": "auto",
(CriticModelRayActor pid=163889) "stage3_param_persistence_threshold": "auto",
(CriticModelRayActor pid=163889) "stage3_prefetch_bucket_size": "auto",
(CriticModelRayActor pid=163889) "reduce_bucket_size": "auto",
(CriticModelRayActor pid=163889) "zero_hpz_partition_size": 1,
(CriticModelRayActor pid=163889) "zero_quantized_weights": false,
(CriticModelRayActor pid=163889) "zero_quantized_gradients": false
(CriticModelRayActor pid=163889) },
(CriticModelRayActor pid=163889) "bf16": {
(CriticModelRayActor pid=163889) "enabled": true
(CriticModelRayActor pid=163889) },
(CriticModelRayActor pid=163889) "gradient_clipping": 1.0,
(CriticModelRayActor pid=163889) "prescale_gradients": false,
(CriticModelRayActor pid=163889) "wall_clock_breakdown": false,
(CriticModelRayActor pid=163889) "data_types": {
(CriticModelRayActor pid=163889) "grad_accum_dtype": "fp32"
(CriticModelRayActor pid=163889) },
(CriticModelRayActor pid=163889) "train_micro_batch_size_per_gpu": 4,
(CriticModelRayActor pid=163889) "train_batch_size": 128
(CriticModelRayActor pid=163889) }
(ActorModelRayActor pid=163692) wandb: Currently logged in as: tianhaowu. Use `wandb login --relogin` to force relogin
(ActorModelRayActor pid=163692) wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
Loading checkpoint shards: 100%|██████████| 3/3 [00:04<00:00, 1.52s/it]
(CriticModelRayActor pid=163889) Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
(CriticModelRayActor pid=163889) Detected CUDA files, patching ldflags
(CriticModelRayActor pid=163889) Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
(CriticModelRayActor pid=163889) Building extension module cpu_adam...
(CriticModelRayActor pid=163889) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(CriticModelRayActor pid=163889) Loading extension module cpu_adam...
(ActorModelRayActor pid=163692) wandb: Tracking run with wandb version 0.16.3
(ActorModelRayActor pid=163692) wandb: Run data is saved locally in /tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/wandb/run-20240217_063522-3w87zqyo
(ActorModelRayActor pid=163692) wandb: Run `wandb offline` to turn off syncing.
(ActorModelRayActor pid=163692) wandb: Syncing run ppo_0217T06:33
(ActorModelRayActor pid=163692) wandb: ⭐️ View project at https://wandb.ai/tianhaowu/openrlhf_train_ppo
(ActorModelRayActor pid=163692) wandb: 🚀 View run at https://wandb.ai/tianhaowu/openrlhf_train_ppo/runs/3w87zqyo
(ActorModelRayActor pid=163888) Adam Optimizer #0 is created with AVX2 arithmetic capability.
(ActorModelRayActor pid=163888) Config: alpha=0.000000, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1
(CriticModelRayActor pid=163889) ***** Critic model is ready
(ActorModelRayActor pid=163888) [rank1]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
Traceback (most recent call last):
File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/examples/train_ppo_ray.py", line 291, in <module>
train(args)
File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/examples/train_ppo_ray.py", line 162, in train
ray.get(refs)
File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::ActorModelRayActor.fit() (pid=163888, ip=0.0.0.0, actor_id=00f790e87ebcba2952fe737b02000000, repr=<openrlhf.trainer.ray.ppo_actor.ActorModelRayActor object at 0x7f5f703cd390>)
File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/openrlhf/trainer/ray/ppo_actor.py", line 282, in fit
trainer = ActorPPOTrainer(
File "/tmp/ray/session_2024-02-17_06-32-38_204743_132895/runtime_resources/working_dir_files/_ray_pkg_d967d6aad00c4088/openrlhf/trainer/ray/ppo_actor.py", line 96, in __init__
torch.distributed.barrier()
File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3439, in barrier
work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:550 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f63740f4d87 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x15c0e0b (0x7f604e191e0b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7f6052460b32 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7f6052461961 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6052416dd1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6052416dd1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f6052416dd1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7f601b654c69 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x22b (0x7f601b65bc5b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #9: <unknown function> + 0x10ad03d (0x7f601b66503d in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x21 (0x7f601b6668e1 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3bf (0x7f601b6688ff in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0xb0e (0x7f601b677d4e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #13: <unknown function> + 0x5838872 (0x7f6052409872 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x5843590 (0x7f6052414590 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x5843695 (0x7f6052414695 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x4e8937c (0x7f6051a5a37c in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x1a08a38 (0x7f604e5d9a38 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x584cca4 (0x7f605241dca4 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x584da55 (0x7f605241ea55 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0xc93e88 (0x7f62b6247e88 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #21: <unknown function> + 0x413ef4 (0x7f62b59c7ef4 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #22: <unknown function> + 0x172df4 (0x556163100df4 in ray::ActorModelRayActor.fit)
frame #23: _PyObject_MakeTpCall + 0x1f8 (0x5561630c7db8 in ray::ActorModelRayActor.fit)
frame #24: <unknown function> + 0xeb5a7 (0x5561630795a7 in ray::ActorModelRayActor.fit)
frame #25: <unknown function> + 0x105bbf (0x556163093bbf in ray::ActorModelRayActor.fit)
frame #26: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #27: _PyObject_Call + 0x1f6 (0x5561630ce3f6 in ray::ActorModelRayActor.fit)
frame #28: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #29: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #30: <unknown function> + 0x10669e (0x55616309469e in ray::ActorModelRayActor.fit)
frame #31: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #32: _PyObject_FastCallDictTstate + 0x162 (0x556163115a92 in ray::ActorModelRayActor.fit)
frame #33: <unknown function> + 0x191f53 (0x55616311ff53 in ray::ActorModelRayActor.fit)
frame #34: <unknown function> + 0x153a21 (0x5561630e1a21 in ray::ActorModelRayActor.fit)
frame #35: _PyObject_Call + 0x259 (0x5561630ce459 in ray::ActorModelRayActor.fit)
frame #36: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #37: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #38: _PyObject_Call + 0xf7 (0x5561630ce2f7 in ray::ActorModelRayActor.fit)
frame #39: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #40: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #41: _PyObject_Call + 0xf7 (0x5561630ce2f7 in ray::ActorModelRayActor.fit)
frame #42: _PyEval_EvalFrameDefault + 0x2216 (0x556163168c16 in ray::ActorModelRayActor.fit)
frame #43: <unknown function> + 0x1871eb (0x5561631151eb in ray::ActorModelRayActor.fit)
frame #44: PyVectorcall_Call + 0x9c (0x556163023c4c in ray::ActorModelRayActor.fit)
frame #45: <unknown function> + 0x5ade2f (0x7f6377ac5e2f in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #46: <unknown function> + 0x5ef9b8 (0x7f6377b079b8 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #47: <unknown function> + 0x5ade2f (0x7f6377ac5e2f in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #48: <unknown function> + 0x670b3e (0x7f6377b88b3e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #49: std::_Function_handler<ray::Status (ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool, long), ray::Status (*)(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string, std::string, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, std::string*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool, bool, bool, long)>::_M_invoke(std::_Any_data const&, ray::rpc::Address const&, ray::rpc::TaskType&&, std::string&&, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*&&, std::string*&&, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string&&, bool&&, bool&&, bool&&, long&&) + 0x169 (0x7f6377acb509 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #50: ray::core::CoreWorker::ExecuteTask(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*) + 0xc5c (0x7f6377ca918c in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #51: std::_Function_handler<ray::Status (ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*), std::_Bind<ray::Status (ray::core::CoreWorker::*(ray::core::CoreWorker*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>, std::_Placeholder<4>, std::_Placeholder<5>, std::_Placeholder<6>, std::_Placeholder<7>, std::_Placeholder<8>))(ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > > const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*, bool*, std::string*)> >::_M_invoke(std::_Any_data const&, ray::TaskSpecification const&, std::shared_ptr<std::unordered_map<std::string, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > >, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::vector<std::pair<long, double>, std::allocator<std::pair<long, double> > > > > > >&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*&&, std::vector<std::pair<ray::ObjectID, bool>, std::allocator<std::pair<ray::ObjectID, bool> > >*&&, google::protobuf::RepeatedPtrField<ray::rpc::ObjectReferenceCount>*&&, bool*&&, std::string*&&) + 0x58 (0x7f6377be0f98 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #52: <unknown function> + 0x7b7664 (0x7f6377ccf664 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #53: <unknown function> + 0x7b889a (0x7f6377cd089a in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #54: <unknown function> + 0x7cfe1e (0x7f6377ce7e1e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #55: ray::core::ActorSchedulingQueue::AcceptRequestOrRejectIfCanceled(ray::TaskID, ray::core::InboundRequest&) + 0x114 (0x7f6377ce8e34 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #56: <unknown function> + 0x7d3a5b (0x7f6377ceba5b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #57: ray::core::ActorSchedulingQueue::Add(long, long, std::function<void (std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status const&, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>)>, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>, std::string const&, std::shared_ptr<ray::FunctionDescriptorInterface> const&, ray::TaskID, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&) + 0x400 (0x7f6377ced570 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #58: ray::core::CoreWorkerDirectTaskReceiver::HandleTask(ray::rpc::PushTaskRequest const&, ray::rpc::PushTaskReply*, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>) + 0x119c (0x7f6377ccefcc in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #59: <unknown function> + 0x75b6f5 (0x7f6377c736f5 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #60: <unknown function> + 0xa2864e (0x7f6377f4064e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #61: <unknown function> + 0xa21a3e (0x7f6377f39a3e in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #62: <unknown function> + 0xa21eb6 (0x7f6377f39eb6 in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
frame #63: <unknown function> + 0x10d550b (0x7f63785ed50b in /root/miniconda3/envs/openrlhf/lib/python3.10/site-packages/ray/_raylet.so)
. This may indicate a possible application crash on rank 0 or a network set up issue.
also could you share the exact version of libraries by using
pip list
in your environment ?thank you so much for the quick response :) hope we can build something cool together
Package Version
-------------------------- ------------
accelerate 0.27.2
aiohttp 3.9.1
aiohttp-cors 0.7.0
aioprometheus 23.3.0
aiorwlock 1.3.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 3.7.1
appdirs 1.4.4
async-timeout 4.0.3
attrs 23.1.0
bitsandbytes 0.42.0
blessed 1.20.0
boltons 23.0.0
brotlipy 0.7.0
bytedance-context 0.7.1
bytedance.metrics 0.4.0
bytedance.servicediscovery 0.1.2
bytedbackgrounds 0.0.6
bytedenv 0.6.2
bytedray 2.6.1
bytedservicediscovery 0.17.4
cachetools 5.3.2
certifi 2023.5.7
cffi 1.15.1
charset-normalizer 2.0.4
click 8.1.7
coloredlogs 15.0.1
colorful 0.5.5
conda 23.5.2
conda-content-trust 0.1.3
conda-libmamba-solver 23.5.0
conda-package-handling 2.1.0
conda_package_streaming 0.8.0
crypto 1.4.1
cryptography 39.0.1
cupy-cuda11x 12.3.0
datasets 2.15.0
deepspeed 0.12.5
dill 0.3.7
distlib 0.3.8
docker-pycreds 0.4.0
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.98.0
fastrlock 0.8.2
filelock 3.9.0
flash-attn 2.3.6
frozenlist 1.3.3
fsspec 2023.10.0
gitdb 4.0.11
GitPython 3.1.40
google-api-core 2.15.0
google-auth 2.25.2
googleapis-common-protos 1.62.0
gpustat 1.0.0
grpcio 1.59.3
h11 0.14.0
hjson 3.1.0
httptools 0.6.1
huggingface-hub 0.20.1
humanfriendly 10.0
idna 3.4
ipaddress 1.0.23
isort 5.13.2
Jinja2 3.1.2
jsonlines 4.0.0
jsonpatch 1.32
jsonpointer 2.1
jsonschema 4.17.3
jsonschema-specifications 2023.11.2
libmambapy 1.4.1
lightning-utilities 0.10.1
loralib 0.1.2
MarkupSafe 2.1.3
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.4
multiprocess 0.70.15
Naked 0.1.32
networkx 3.0
ninja 1.11.1.1
numpy 1.26.2
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-ml-py 11.495.46
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
opencensus 0.11.3
opencensus-context 0.1.3
optimum 1.17.1
orjson 3.9.10
packaging 23.0
pandas 2.1.4
peft 0.8.2
Pillow 9.3.0
pip 23.1.2
platformdirs 3.11.0
pluggy 1.0.0
prometheus-client 0.13.1
protobuf 3.20.3
psutil 5.9.7
py-cpuinfo 9.0.0
py-spy 0.3.14
pyarrow 14.0.2
pyarrow-hotfix 0.6
pyasn1 0.5.1
pyasn1-modules 0.3.0
pycosat 0.6.4
pycparser 2.21
pycryptodome 3.18.0
pydantic 1.10.13
pydantic_core 2.14.5
pynvml 11.5.0
pyOpenSSL 23.0.0
pyrsistent 0.20.0
PySocks 1.7.1
python-dateutil 2.8.2
python-dotenv 1.0.0
pytz 2023.3.post1
PyYAML 6.0.1
quantile-python 1.1
referencing 0.32.0
regex 2023.10.3
requests 2.29.0
rpds-py 0.15.2
rsa 4.9
ruamel.yaml 0.17.21
ruamel.yaml.clib 0.2.6
safetensors 0.4.1
schedule 1.2.1
scipy 1.12.0
sentencepiece 0.1.99
sentry-sdk 1.39.1
setproctitle 1.3.3
setuptools 67.8.0
shellescape 3.8.1
six 1.16.0
smart-open 6.4.0
smmap 5.0.1
sniffio 1.3.0
starlette 0.27.0
sympy 1.12
tabulate 0.9.0
tensorboardX 2.6.2.2
tokenizers 0.15.0
toolz 0.12.0
torch 2.1.1+cu118
torchaudio 2.1.1+cu118
torchmetrics 1.3.1
torchvision 0.16.1+cu118
tqdm 4.65.0
transformers 4.37.1
triton 2.1.0
typing_extensions 4.9.0
tzdata 2023.3
urllib3 1.26.16
uvicorn 0.21.1
uvloop 0.19.0
virtualenv 20.21.0
vllm 0.2.3+cu118
wandb 0.16.1
watchfiles 0.21.0
wcwidth 0.2.12
websockets 12.0
wheel 0.38.4
xformers 0.0.23+cu118
xxhash 3.4.1
yarl 1.9.4
zstandard 0.19.0
also could you share the exact version of libraries by using
pip list
in your environment ? thank you so much for the quick response :) hope we can build something cool togetherPackage Version -------------------------- ------------ accelerate 0.27.2 aiohttp 3.9.1 aiohttp-cors 0.7.0 aioprometheus 23.3.0 aiorwlock 1.3.0 aiosignal 1.3.1 annotated-types 0.6.0 anyio 3.7.1 appdirs 1.4.4 async-timeout 4.0.3 attrs 23.1.0 bitsandbytes 0.42.0 blessed 1.20.0 boltons 23.0.0 brotlipy 0.7.0 bytedance-context 0.7.1 bytedance.metrics 0.4.0 bytedance.servicediscovery 0.1.2 bytedbackgrounds 0.0.6 bytedenv 0.6.2 bytedray 2.6.1 bytedservicediscovery 0.17.4 cachetools 5.3.2 certifi 2023.5.7 cffi 1.15.1 charset-normalizer 2.0.4 click 8.1.7 coloredlogs 15.0.1 colorful 0.5.5 conda 23.5.2 conda-content-trust 0.1.3 conda-libmamba-solver 23.5.0 conda-package-handling 2.1.0 conda_package_streaming 0.8.0 crypto 1.4.1 cryptography 39.0.1 cupy-cuda11x 12.3.0 datasets 2.15.0 deepspeed 0.12.5 dill 0.3.7 distlib 0.3.8 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 fastapi 0.98.0 fastrlock 0.8.2 filelock 3.9.0 flash-attn 2.3.6 frozenlist 1.3.3 fsspec 2023.10.0 gitdb 4.0.11 GitPython 3.1.40 google-api-core 2.15.0 google-auth 2.25.2 googleapis-common-protos 1.62.0 gpustat 1.0.0 grpcio 1.59.3 h11 0.14.0 hjson 3.1.0 httptools 0.6.1 huggingface-hub 0.20.1 humanfriendly 10.0 idna 3.4 ipaddress 1.0.23 isort 5.13.2 Jinja2 3.1.2 jsonlines 4.0.0 jsonpatch 1.32 jsonpointer 2.1 jsonschema 4.17.3 jsonschema-specifications 2023.11.2 libmambapy 1.4.1 lightning-utilities 0.10.1 loralib 0.1.2 MarkupSafe 2.1.3 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.4 multiprocess 0.70.15 Naked 0.1.32 networkx 3.0 ninja 1.11.1.1 numpy 1.26.2 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 11.495.46 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 opencensus 0.11.3 opencensus-context 0.1.3 optimum 1.17.1 orjson 3.9.10 packaging 23.0 pandas 2.1.4 peft 0.8.2 Pillow 9.3.0 pip 23.1.2 platformdirs 3.11.0 pluggy 1.0.0 prometheus-client 0.13.1 protobuf 3.20.3 psutil 5.9.7 py-cpuinfo 9.0.0 py-spy 0.3.14 pyarrow 14.0.2 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycosat 0.6.4 pycparser 2.21 pycryptodome 3.18.0 pydantic 1.10.13 pydantic_core 2.14.5 pynvml 11.5.0 pyOpenSSL 23.0.0 pyrsistent 0.20.0 PySocks 1.7.1 python-dateutil 2.8.2 python-dotenv 1.0.0 pytz 2023.3.post1 PyYAML 6.0.1 quantile-python 1.1 referencing 0.32.0 regex 2023.10.3 requests 2.29.0 rpds-py 0.15.2 rsa 4.9 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.6 safetensors 0.4.1 schedule 1.2.1 scipy 1.12.0 sentencepiece 0.1.99 sentry-sdk 1.39.1 setproctitle 1.3.3 setuptools 67.8.0 shellescape 3.8.1 six 1.16.0 smart-open 6.4.0 smmap 5.0.1 sniffio 1.3.0 starlette 0.27.0 sympy 1.12 tabulate 0.9.0 tensorboardX 2.6.2.2 tokenizers 0.15.0 toolz 0.12.0 torch 2.1.1+cu118 torchaudio 2.1.1+cu118 torchmetrics 1.3.1 torchvision 0.16.1+cu118 tqdm 4.65.0 transformers 4.37.1 triton 2.1.0 typing_extensions 4.9.0 tzdata 2023.3 urllib3 1.26.16 uvicorn 0.21.1 uvloop 0.19.0 virtualenv 20.21.0 vllm 0.2.3+cu118 wandb 0.16.1 watchfiles 0.21.0 wcwidth 0.2.12 websockets 12.0 wheel 0.38.4 xformers 0.0.23+cu118 xxhash 3.4.1 yarl 1.9.4 zstandard 0.19.0
Thx for the information!!! Here is my pip list:
vllm 0.3.0+cu123 /workspace/vllm-fork
torch 2.2.0
Can it be related to the vllm version?
@tianhao-nexusflow I don't think it's related to vllm. Is your cuda version 12.3?
@tianhao-nexusflow I don't think it's related to vllm. Is your cuda version 12.3?
Package Version Editable project location ----------------------------- ----------- ------------------------- accelerate 0.27.2 aiohttp 3.9.3 aiohttp-cors 0.7.0 aioprometheus 23.12.0 aiosignal 1.3.1 annotated-types 0.6.0 anyio 4.2.0 appdirs 1.4.4 async-timeout 4.0.3 attrs 23.2.0 bitsandbytes 0.42.0 blessed 1.20.0 cachetools 5.3.2 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 coloredlogs 15.0.1 colorful 0.5.6 cupy-cuda12x 12.1.0 datasets 2.17.0 deepspeed 0.13.1 dill 0.3.8 distlib 0.3.8 docker-pycreds 0.4.0 einops 0.7.0 exceptiongroup 1.2.0 fastapi 0.109.2 fastrlock 0.8.2 filelock 3.13.1 flash-attn 2.5.3 frozenlist 1.4.1 fsspec 2023.10.0 gitdb 4.0.11 GitPython 3.1.42 google-api-core 2.17.1 google-auth 2.28.0 googleapis-common-protos 1.62.0 gpustat 1.1.1 grpcio 1.60.1 h11 0.14.0 hjson 3.1.0 httptools 0.6.1 huggingface-hub 0.20.3 humanfriendly 10.0 idna 3.6 isort 5.13.2 Jinja2 3.1.3 jsonlines 4.0.0 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 lightning-utilities 0.10.1 loralib 0.1.2 MarkupSafe 2.1.5 mpmath 1.3.0 msgpack 1.0.7 multidict 6.0.5 multiprocess 0.70.16 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.535.133 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 opencensus 0.11.4 opencensus-context 0.1.3 openrlhf 0.1.9 /workspace/OpenRLHF optimum 1.16.2 orjson 3.9.14 packaging 23.2 pandas 2.2.0 peft 0.8.2 pip 23.3.1 platformdirs 4.2.0 prometheus_client 0.20.0 protobuf 4.25.3 psutil 5.9.8 py-cpuinfo 9.0.0 py-spy 0.3.14 pyarrow 15.0.0 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pydantic 2.6.1 pydantic_core 2.16.2 pynvml 11.5.0 python-dateutil 2.8.2 python-dotenv 1.0.1 pytz 2024.1 PyYAML 6.0.1 quantile-python 1.1 ray 2.9.2 referencing 0.33.0 regex 2023.12.25 requests 2.31.0 rpds-py 0.18.0 rsa 4.9 safetensors 0.4.2 scipy 1.12.0 sentencepiece 0.1.99 sentry-sdk 1.40.4 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 smart-open 6.4.0 smmap 5.0.1 sniffio 1.3.0 starlette 0.36.3 sympy 1.12 tokenizers 0.15.2 torch 2.2.0 torchmetrics 1.3.1 tqdm 4.66.2 transformers 4.37.1 transformers-stream-generator 0.0.4 triton 2.2.0 typing_extensions 4.9.0 tzdata 2024.1 urllib3 2.2.0 uvicorn 0.27.1 uvloop 0.19.0 virtualenv 20.25.0 vllm 0.3.0+cu123 /workspace/vllm-fork wandb 0.16.3 watchfiles 0.21.0 wcwidth 0.2.13 websockets 12.0 wheel 0.41.2 xformers 0.0.24 xxhash 3.4.1 yarl 1.9.4
Given the pip list, I think the cuda version is 12.1?
@tianhao-nexusflow Can you post your run command and hardware info?
@tianhao-nexusflow Can you post your run command and hardware info?
Sure! the run command is:
set -x
export PATH=$HOME/.local/bin/:$PATH
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/workspace/OpenRLHF"}' \
-- python3 examples/train_ppo_ray.py \
--ref_num_nodes 1 \
--ref_num_gpus_per_node 1 \
--reward_num_nodes 1 \
--reward_num_gpus_per_node 1 \
--critic_num_nodes 1 \
--critic_num_gpus_per_node 2 \
--actor_num_nodes 1 \
--actor_num_gpus_per_node 2 \
--vllm_num_engines 1 \
--vllm_tensor_parallel_size 1 \
--pretrain OpenLLMAI/Llama-2-7b-sft-model-ocra-500k \
--reward_pretrain OpenLLMAI/Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt \
--save_path /openrlhf/examples/test_scripts/ckpt/7b_llama \
--micro_train_batch_size 4 \
--train_batch_size 128 \
--micro_rollout_batch_size 8 \
--rollout_batch_size 1024 \
--max_epochs 1 \
--prompt_max_len 1024 \
--generate_max_len 1024 \
--zero_stage 2 \
--bf16 \
--actor_learning_rate 5e-7 \
--critic_learning_rate 9e-6 \
--init_kl_coef 0.01 \
--prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward \
--prompt_data_probs 0.4,0.5,0.1 \
--max_samples 80000 \
--normalize_reward \
--actor_init_on_gpu \
--adam_offload \
--flash_attn \
--gradient_checkpointing
For the hardware:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 26C P0 65W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0F:00.0 Off | 0 |
| N/A 25C P0 61W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:47:00.0 Off | 0 |
| N/A 27C P0 61W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4E:00.0 Off | 0 |
| N/A 27C P0 61W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:87:00.0 Off | 0 |
| N/A 30C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:90:00.0 Off | 0 |
| N/A 31C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:B7:00.0 Off | 0 |
| N/A 30C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:BD:00.0 Off | 0 |
| N/A 31C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
@tianhao-nexusflow I can't reproduce with your script either, let me switch to cuda 12 and torch 2.2.
@wuxibin89 I found that there are some compatibility issues between the vLLM and Pytorch 23.12 docker images. It seems that going forward we should provide a dedicated image for users.
@tianhao-nexusflow Do you build vllm from source with torch==2.2? I found that vllm==0.3.0 is build with torch==2.1.2
This should be a problem caused by a container environment. @wuxibin89 can run VLLM well with his container image. It hangs with nvidia pytorch 23.12. @karthik19967829
great thanks for the inputs team @hijkzzz if you guys can share a docker image that works that will be great
if you are able to get it working on top of HF container or something similar should make the setup seamless
OK ! pip install vllm==0.2.4 is all you need !! OpenRLHF will be compatible with vllm=0.3.1 as soon as possible
@karthik19967829
Great thanks @hijkzzz
we are testing it out, will ping you once we observe the logs
OK ! pip install vllm==0.2.4 is all you need !! OpenRLHF will be compatible with vllm=0.3.1 as soon as possible
@karthik19967829
@hijkzzz Thanks for your insightful comment! I've followed the updated README, however when I tried to
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8,
I get
bash: ray: command not found
However when I pip list, everything is installed. I guess it's related to PYTHONPATH but haven't figure it out.
OK ! pip install vllm==0.2.4 is all you need !! OpenRLHF will be compatible with vllm=0.3.1 as soon as possible @karthik19967829
@hijkzzz Thanks for your insightful comment! I've followed the updated README, however when I tried to
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8,
I get
bash: ray: command not found
However when I pip list, everything is installed. I guess it's related to PYTHONPATH but haven't figure it out.
./build_openrlhf.sh ~/.local/bin/ray
@karthik19967829 @tianhao-nexusflow This problem is related to vllm version, we have some monkey patch to vllm for weight synchronization between vllm and actor model.
In vllm==0.2.7, they do a major change to their architecture, which break our monkey patch. For quick fix, you can downgrade vllm to <=0.2.6. I will fix this very soon. https://github.com/vllm-project/vllm/pull/2221
Great thanks @wuxibin89 , after downgrading zero2+vllm is running but the throughput is not very different than vanilla , need to benchmark more closely , but whats the general practice to increase speed up ?
@karthik19967829 For PPO RLHF, sequence generation is the major bottleneck, increasing vllm_num_engines
can reduce the generation time. For 8 GPUs and 7B model, you can try (ref=1, reward=1, actor=2, critic=2, vllm=2).
@karthik19967829 Our distributed design is for models above 13b, please set up more GPUs for vllm. for 7b models, you could try openrlhf without ray
@karthik19967829 @tianhao-nexusflow Fixed in https://github.com/OpenLLMAI/OpenRLHF/pull/215, I have tested vllm==0.2.3 and vllm==0.3.1 with vllm_tensor_parallel_size=1/2, all tests have passed.
@karthik19967829 @tianhao-nexusflow Fixed in #215, I have tested vllm==0.2.3 and vllm==0.3.1 with vllm_tensor_parallel_size=1/2, all tests have passed.
@wuxibin89 @hijkzzz Thx! It works now, appreciate the help!!!
@wuxibin89 @hijkzzz thank you so much for the quick help! best open-source team I have met!
vllm==0.4.1 also hangs @wuxibin89
Team, thank you so much for this wonderful toolkit! we are trying to test the vllm setting with mistralai/Mistral-7B-Instruct-v0.2 model with zero2
ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \ -- python3 examples/train_ppo_ray.py \ --ref_num_nodes 1 \ --ref_num_gpus_per_node 1 \ --reward_num_nodes 1 \ --reward_num_gpus_per_node 1 \ --critic_num_nodes 1 \ --critic_num_gpus_per_node 1 \ --actor_num_nodes 1 \ --actor_num_gpus_per_node 4 \ --pretrain openchat/openchat_3.5 \ --reward_pretrain openchat/openchat_3.5 \ --critic_pretrain openchat/openchat_3.5 \ --save_path /openrlhf/examples/scripts/ckpt/starling_7b \ --micro_train_batch_size 4 \ --train_batch_size 128 \ --micro_rollout_batch_size 16 \ --rollout_batch_size 256 \ --max_epochs 1 \ --prompt_max_len 1024 \ --generate_max_len 1024 \ --zero_stage 2 \ --bf16 \ --actor_learning_rate 2e-7 \ --critic_learning_rate 3e-6 \ --init_kl_coef 0.001 \ --prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf \ --prompt_data_probs 1 \ --max_samples 256 \ --actor_init_on_gpu \ --adam_offload \ --gradient_checkpointing \ --vllm_num_engines 1 \ --vllm_tensor_parallel_size 1