OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

Strange Kill of Critic Model #305

Open Ricardokevins opened 1 month ago

Ricardokevins commented 1 month ago

我使用ppo ray进行训练。训练会正常的进行若干步,随后出现错误

File "/tmp/ray/session_2024-05-24_17-35-31_318483_337945/runtime_resources/working_dir_files/_ray_pkg_d887115d5fd5f465/openrlhf/trainer/ray/ppo_actor.py", line 115, in ppo_train
    status.update(ray.get(critic_status_ref))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: CriticModelRayActor
        actor_id: 66075e4f50cc9155208f189803000000
        pid: 375881
        namespace: f0753efe-aacc-438f-969d-afdc956dd354
        ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

回溯后发现在现在的训练日志里出现了

A workerker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff66075e4f50cc9155208f189803000000 Worker ID: 50b0e0f384490a6852903bbdc1cef21065c230c6000a8077d28026a6 Node ID: 6346a64af9cca6a4ba3997b438ce356ae591f0047d7da61374102e4b Worker IP address: 0.0.0.0 Worker port: 10148 Worker PID: 375881 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Ricardokevins commented 1 month ago

对应的err文件

:job_id:03000000 [2024-05-24 18:57:09,526] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) :actor_name:CriticModelRayActor [2024-05-24 18:58:07,087] [INFO] [comm.py:637:init_distributed] cdb=None [2024-05-24 18:58:07,088] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl LLMForSequenceRegression( (model): LlamaModel( (embed_tokens): Embedding(128256, 4096) (layers): ModuleList( (0-31): 32 x LlamaDecoderLayer( (self_attn): LlamaFlashAttention2( (q_proj): Linear(in_features=4096, out_features=4096, bias=False) (k_proj): Linear(in_features=4096, out_features=1024, bias=False) (v_proj): Linear(in_features=4096, out_features=1024, bias=False) (o_proj): Linear(in_features=4096, out_features=4096, bias=False) (rotary_emb): LlamaRotaryEmbedding() ) (mlp): LlamaMLP( (gate_proj): Linear(in_features=4096, out_features=14336, bias=False) (up_proj): Linear(in_features=4096, out_features=14336, bias=False) (down_proj): Linear(in_features=14336, out_features=4096, bias=False) (act_fn): SiLU() ) (input_layernorm): LlamaRMSNorm() (post_attention_layernorm): LlamaRMSNorm() ) ) (norm): LlamaRMSNorm() ) (value_head): Linear(in_features=4096, out_features=1, bias=False) ) reward normalization status: True mean: tensor([0.], dtype=torch.bfloat16), std tensor([1.], dtype=torch.bfloat16) Time to load cpu_adam op: 2.4042906761169434 seconds [2024-05-24 18:58:16,843] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.5, git-hash=unknown, git-branch=unknown [2024-05-24 18:58:16,843] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000009, betas=(0.900000, 0.950000), weight_decay=0.000000, adam_w=1 n136-112-040:375881:375881 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 n136-112-040:375881:375881 [0] NCCL INFO Bootstrap : Using eth0:10.136.112.40<0> n136-112-040:375881:375881 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation n136-112-040:375881:375881 [0] NCCL INFO cudaDriverVersion 12010 NCCL version 2.20.5+cuda12.4 n136-112-040:375881:377674 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. n136-112-040:375881:377674 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 n136-112-040:375881:377674 [0] NCCL INFO NCCL_IB_HCA set to mlx5 n136-112-040:375881:377674 [0] NCCL INFO NET/IB : No device found. n136-112-040:375881:377674 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0 n136-112-040:375881:377674 [0] NCCL INFO NET/Socket : Using [0]eth0:10.136.112.40<0> n136-112-040:375881:377674 [0] NCCL INFO Using non-device net plugin version 0 n136-112-040:375881:377674 [0] NCCL INFO Using network Socket n136-112-040:375881:377674 [0] NCCL INFO comm 0x1bbd32a0 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 4a000 commId 0xb015bed65ddba90d - Init START n136-112-040:375881:377674 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff n136-112-040:375881:377674 [0] NCCL INFO comm 0x1bbd32a0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-112-040:375881:377674 [0] NCCL INFO Channel 00/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 01/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 02/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 03/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 04/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 05/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 06/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 07/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 08/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 09/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 10/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 11/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 12/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 13/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 14/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 15/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 16/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 17/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 18/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 19/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 20/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 21/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 22/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Channel 23/24 : 0 1 n136-112-040:375881:377674 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-112-040:375881:377674 [0] NCCL INFO P2P Chunksize set to 524288 n136-112-040:375881:377674 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 04/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 05/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 06/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 07/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 08/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 09/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 10/0 : 0[2] -> 1[3] via P2[2024-05-24 18:58:21,492] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-05-24 18:58:21,493] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2024-05-24 18:58:21,493] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-05-24 18:58:21,504] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2024-05-24 18:58:21,504] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2024-05-24 18:58:21,504] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer [2024-05-24 18:58:21,504] [INFO] [stage_1_and_2.py:149:init] Reduce bucket size 500,000,000 [2024-05-24 18:58:21,504] [INFO] [stage_1_and_2.py:150:init] Allgather bucket size 500,000,000 [2024-05-24 18:58:21,504] [INFO] [stage_1_and_2.py:151:init] CPU Offload: True [2024-05-24 18:58:21,504] [INFO] [stage_1_and_2.py:152:init] Round robin gradient partitioning: False [2024-05-24 18:58:40,366] [INFO] [utils.py:800:see_memory_usage] Before initializing optimizer states [2024-05-24 18:58:40,367] [INFO] [utils.py:801:see_memory_usage] MA 15.08 GB Max_MA 15.08 GB CA 15.59 GB Max_CA 16 GB [2024-05-24 18:58:40,367] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 222.94 GB, percent = 11.1% [2024-05-24 18:58:47,530] [INFO] [utils.py:800:see_memory_usage] After initializing optimizer states [2024-05-24 18:58:47,530] [INFO] [utils.py:801:see_memory_usage] MA 15.08 GB Max_MA 15.08 GB CA 15.59 GB Max_CA 16 GB [2024-05-24 18:58:47,530] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 245.8 GB, percent = 12.2% [2024-05-24 18:58:47,530] [INFO] [stage_1_and_2.py:539:init] optimizer state initialized [2024-05-24 18:58:47,639] [INFO] [utils.py:800:see_memory_usage] After initializing ZeRO optimizer [2024-05-24 18:58:47,639] [INFO] [utils.py:801:see_memory_usage] MA 15.08 GB Max_MA 15.08 GB CA 15.59 GB Max_CA 16 GB [2024-05-24 18:58:47,639] [INFO] [utils.py:808:see_memory_usage] CPU Virtual Memory: used = 245.78 GB, percent = 12.2% [2024-05-24 18:58:47,643] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam [2024-05-24 18:58:47,643] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-05-24 18:58:47,643] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7fa8334f8ed0> [2024-05-24 18:58:47,643] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.95), (0.9, 0.95)] [2024-05-24 18:58:47,644] [INFO] [config.py:996:print] DeepSpeedEngine configuration: [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] amp_enabled .................. False [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] amp_params ................... False [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] bfloat16_enabled ............. True [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] bfloat16_immediate_grad_update False [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] checkpoint_parallel_write_pipeline False [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] checkpoint_tag_validation_enabled True [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] checkpoint_tag_validation_fail False [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fa832d61a50> [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] communication_data_type ...... None [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] compile_config ............... enabled=False backend='inductor' kwargs={} [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] curriculum_enabled_legacy .... False [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] curriculum_params_legacy ..... False [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-05-24 18:58:47,644] [INFO] [config.py:1000:print] data_efficiency_enabled ...... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] dataloader_drop_last ......... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] disable_allgather ............ False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] dump_state ................... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] dynamic_loss_scale_args ...... None [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_enabled ........... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_gas_boundary_resolution 1 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_layer_num ......... 0 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_max_iter .......... 100 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_stability ......... 1e-06 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_tol ............... 0.01 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] eigenvalue_verbose ........... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] elasticity_enabled ........... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] fp16_auto_cast ............... None [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] fp16_enabled ................. False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] fp16_master_weights_and_gradients False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] global_rank .................. 0 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] grad_accum_dtype ............. bf16 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] gradient_accumulation_steps .. 16 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] gradient_clipping ............ 1.0 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] gradient_predivide_factor .... 1.0 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] graph_harvesting ............. False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] initial_dynamic_scale ........ 1 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] load_universal_checkpoint .... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] loss_scale ................... 1.0 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] memory_breakdown ............. False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] mics_hierarchial_params_gather False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] mics_shard_size .............. -1 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] optimizer_legacy_fusion ...... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] optimizer_name ............... None [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] optimizer_params ............. None [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] pld_enabled .................. False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] pld_params ................... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] prescale_gradients ........... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] scheduler_name ............... None [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] scheduler_params ............. None [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] seq_parallel_communication_data_type torch.float32 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] sparse_attention ............. None [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] sparse_gradients_enabled ..... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] steps_per_print .............. 100 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] train_batch_size ............. 64 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] train_micro_batch_size_per_gpu 2 [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] use_data_before_expertparallel False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] use_node_local_storage ....... False [2024-05-24 18:58:47,645] [INFO] [config.py:1000:print] wall_clock_breakdown ......... False [2024-05-24 18:58:47,646] [INFO] [config.py:1000:print] weight_quantization_config ... None [2024-05-24 18:58:47,646] [INFO] [config.py:1000:print] world_size ................... 2 [2024-05-24 18:58:47,646] [INFO] [config.py:1000:print] zero_allow_untested_optimizer False [2024-05-24 18:58:47,646] [INFO] [config.py:1000:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-05-24 18:58:47,646] [INFO] [config.py:1000:print] zero_enabled ................. True [2024-05-24 18:58:47,646] [INFO] [config.py:1000:print] zero_force_ds_cpu_optimizer .. True [2024-05-24 18:58:47,646] [INFO] [config.py:1000:print] zero_optimization_stage ...... 2 [2024-05-24 18:58:47,646] [INFO] [config.py:986:print_user_config] json = { "steps_per_print": 100, "zero_optimization": { "stage": 2, "offload_param": { "device": "none" }, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "sub_group_size": "auto", "stage3_max_live_parameters": "auto", "stage3_max_reuse_distance": "auto", "stage3_param_persistence_threshold": "auto", "stage3_prefetch_bucket_size": "auto", "reduce_bucket_size": "auto", "zero_hpz_partition_size": 1, "zero_quantized_weights": false, "zero_quantized_gradients": false }, "bf16": { "enabled": true }, "gradient_clipping": 1.0, "prescale_gradients": false, "wall_clock_breakdown": false, "data_types": { "grad_accum_dtype": "bf16" }, "train_micro_batch_size_per_gpu": 2, "train_batch_size": 64 } Generates critic values. ===== I am alive P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 11/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 12/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 13/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 14/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 15/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 16/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 17/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 18/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 19/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 20/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 21/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 22/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Channel 23/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:377674 [0] NCCL INFO Connected all rings n136-112-040:375881:377674 [0] NCCL INFO Connected all trees n136-112-040:375881:377674 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-112-040:375881:377674 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-112-040:375881:377674 [0] NCCL INFO comm 0x1bbd32a0 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 4a000 commId 0xb015bed65ddba90d - Init COMPLETE Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive Generates critic values. ===== I am alive I am progressing 0 !~ n136-112-040:375881:382814 [0] NCCL INFO Using non-device net plugin version 0 n136-112-040:375881:382814 [0] NCCL INFO Using network Socket n136-112-040:375881:382814 [0] NCCL INFO comm 0x26252b80 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 4a000 commId 0xe84413f7a9d26087 - Init START n136-112-040:375881:382814 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff n136-112-040:375881:382814 [0] NCCL INFO comm 0x26252b80 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0 n136-112-040:375881:382814 [0] NCCL INFO Channel 00/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 01/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 02/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 03/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 04/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 05/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 06/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 07/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 08/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 09/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 10/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 11/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 12/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 13/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 14/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 15/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 16/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 17/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 18/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 19/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 20/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 21/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 22/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Channel 23/24 : 0 1 n136-112-040:375881:382814 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1 n136-112-040:375881:382814 [0] NCCL INFO P2P Chunksize set to 524288 n136-112-040:375881:382814 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 04/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 05/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 06/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 07/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 08/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 09/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 10/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 11/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 12/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 13/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 14/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 15/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 16/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 17/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 18/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 19/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 20/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 21/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 22/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Channel 23/0 : 0[2] -> 1[3] via P2P/CUMEM/read n136-112-040:375881:382814 [0] NCCL INFO Connected all rings n136-112-040:375881:382814 [0] NCCL INFO Connected all trees n136-112-040:375881:382814 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 n136-112-040:375881:382814 [0] NCCL INFO 24 coll channels, 0 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer n136-112-040:375881:382814 [0] NCCL INFO comm 0x26252b80 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 4a000 commId 0xe84413f7a9d26087 - Init COMPLETE I am progressing 1 !~ I am progressing 2 !~ I am progressing 3 !~ I am progressing 4 !~ I am progressing 5 !~ I am progressing 6 !~ I am progressing 7 !~ I am progressing 8 !~ I am progressing 9 !~ I am progressing 10 !~ I am progressing 11 !~ I am progressing 12 !~ I am progressing 13 !~ I am progressing 14 !~ I am progressing 15 !~

和 :job_id:03000000 /home/tiger/.local/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( :actor_name:CriticModelRayActor The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda').

Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:01, 1.54it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:01<00:01, 1.34it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:02<00:00, 1.21it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.69it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00, 1.53it/s] Some weights of LLMForSequenceRegression were not initialized from the model checkpoint at /mnt/bn/shesjlq20t/HDFS/Trained/Llama3-8b-chat-rm-v14 and are newly initialized: ['value_head.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Using /home/tiger/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... Loading extension module cpu_adam... libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.

Train epoch [1/1]: 0%| | 0/64 [00:00<?, ?it/s] Train epoch [1/1]: 0%| | 0/64 [00:10<?, ?it/s, critic_loss=0.0367, values=0.42] Train epoch [1/1]: 2%|▏ | 1/64 [00:10<10:37, 10.12s/it, critic_loss=0.0367, values=0.42] Train epoch [1/1]: 2%|▏ | 1/64 [00:12<10:37, 10.12s/it, critic_loss=0.0226, values=0.271] Train epoch [1/1]: 3%|▎ | 2/64 [00:12<05:28, 5.30s/it, critic_loss=0.0226, values=0.271] Train epoch [1/1]: 3%|▎ | 2/64 [00:14<05:28, 5.30s/it, critic_loss=0.0202, values=0.19] Train epoch [1/1]: 5%|▍ | 3/64 [00:14<04:02, 3.98s/it, critic_loss=0.0202, values=0.19] Train epoch [1/1]: 5%|▍ | 3/64 [00:16<04:02, 3.98s/it, critic_loss=0.0462, values=0.646] Train epoch [1/1]: 6%|▋ | 4/64 [00:16<03:10, 3.17s/it, critic_loss=0.0462, values=0.646] Train epoch [1/1]: 6%|▋ | 4/64 [00:18<03:10, 3.17s/it, critic_loss=0.0225, values=0.54] Train epoch [1/1]: 8%|▊ | 5/64 [00:18<02:42, 2.75s/it, critic_loss=0.0225, values=0.54] Train epoch [1/1]: 8%|▊ | 5/64 [00:20<02:42, 2.75s/it, critic_loss=0.021, values=0.117] Train epoch [1/1]: 9%|▉ | 6/64 [00:20<02:20, 2.42s/it, critic_loss=0.021, values=0.117] Train epoch [1/1]: 9%|▉ | 6/64 [00:21<02:20, 2.42s/it, critic_loss=0.0237, values=0.508] Train epoch [1/1]: 11%|█ | 7/64 [00:21<02:05, 2.21s/it, critic_loss=0.0237, values=0.508] Train epoch [1/1]: 11%|█ | 7/64 [00:23<02:05, 2.21s/it, critic_loss=0.0565, values=0.0332] Train epoch [1/1]: 12%|█▎ | 8/64 [00:23<01:57, 2.09s/it, critic_loss=0.0565, values=0.0332] Train epoch [1/1]: 12%|█▎ | 8/64 [00:25<01:57, 2.09s/it, critic_loss=0.0381, values=0.412] Train epoch [1/1]: 14%|█▍ | 9/64 [00:25<01:48, 1.98s/it, critic_loss=0.0381, values=0.412] Train epoch [1/1]: 14%|█▍ | 9/64 [00:27<01:48, 1.98s/it, critic_loss=0.0519, values=0.582] Train epoch [1/1]: 16%|█▌ | 10/64 [00:27<01:45, 1.96s/it, critic_loss=0.0519, values=0.582] Train epoch [1/1]: 16%|█▌ | 10/64 [00:29<01:45, 1.96s/it, critic_loss=0.0439, values=0.0381] Train epoch [1/1]: 17%|█▋ | 11/64 [00:29<01:44, 1.98s/it, critic_loss=0.0439, values=0.0381] Train epoch [1/1]: 17%|█▋ | 11/64 [00:31<01:44, 1.98s/it, critic_loss=0.0211, values=0.313] Train epoch [1/1]: 19%|█▉ | 12/64 [00:31<01:40, 1.93s/it, critic_loss=0.0211, values=0.313] Train epoch [1/1]: 19%|█▉ | 12/64 [00:32<01:40, 1.93s/it, critic_loss=0.041, values=0.951] Train epoch [1/1]: 20%|██ | 13/64 [00:32<01:32, 1.82s/it, critic_loss=0.041, values=0.951] Train epoch [1/1]: 20%|██ | 13/64 [00:34<01:32, 1.82s/it, critic_loss=0.0165, values=-0.578] Train epoch [1/1]: 22%|██▏ | 14/64 [00:34<01:34, 1.89s/it, critic_loss=0.0165, values=-0.578] Train epoch [1/1]: 22%|██▏ | 14/64 [00:36<01:34, 1.89s/it, critic_loss=0.0243, values=0.105] Train epoch [1/1]: 23%|██▎ | 15/64 [00:36<01:31, 1.87s/it, critic_loss=0.0243, values=0.105]

Ricardokevins commented 1 month ago

Generates critic values. ===== I am alive 和 I am progressing 15 !~是我加上的调试语句,希望可以看出哪里有问题(但是失败了

Ricardokevins commented 1 month ago

我用的机器内存蛮大的,有2T

下面是启动脚本

set -x 

# ray start --head --node-ip-address 0.0.0.0 --num-gpus 8
# if you want to launch ray on more nodes, use
# ray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8
# ray stop
# ps aux | grep '/usr/bin/python3' | grep -v grep | awk '{print $2}' | xargs kill
WANDB_PROJECT=${project} WANDB_NAME=${expr} ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{"working_dir": "xxxxx/OpenRLHF-main", "pip": "xxxxxxxOpenRLHF-main/requirements.txt"}' \
    -- python3 examples/train_ppo_ray.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 1 \
    --reward_num_nodes 1 \
    --reward_num_gpus_per_node 1 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 2 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 2 \
    --vllm_num_engines 2 \
    --vllm_tensor_parallel_size 1 \
    --use_wandb HelloWorldHelloWorldHelloWorldHelloWorld \
    --wandb_project ${project} \
    --wandb_run_name ${expr} \
    --save_path ./7b_llama \
    --micro_train_batch_size 2 \
    --train_batch_size 64 \
    --micro_rollout_batch_size 4 \
    --rollout_batch_size 256 \
    --max_epochs 1 \
    --grad_accum_dtype bf16 \
    --prompt_max_len 1024 \
    --generate_max_len 1024 \
    --zero_stage 2 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --init_kl_coef 0.01 \
    --max_samples 10000 \
    --normalize_reward \
    --actor_init_on_gpu \
    --adam_offload \
    --flash_attn \
    --gradient_checkpointing

主要是critic model被kill的没有报错,感觉很奇怪,没有什么思路

Ricardokevins commented 1 month ago

好像用Lora来训练就暂时没有报错。不过还是希望可以全量训练。

  --lora_rank 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \

我注意到一个奇怪的现象,有没有可能是因为critic model在两张卡share参数的时候显存分配问题?如GPU 4,5

image
hijkzzz commented 1 month ago
  1. 检查是不是用的我们提供的 docker image https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile 或者类似兼容的
  2. 可以尝试下:
    
    git pull (升级了ray 减少了内存使用)

ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \ -- python3 examples/train_ppo_ray.py \ --ref_num_nodes 1 \ --ref_num_gpus_per_node 2 \ --reward_num_nodes 1 \ --reward_num_gpus_per_node 2 \ --critic_num_nodes 1 \ --critic_num_gpus_per_node 2 \ --actor_num_nodes 1 \ --actor_num_gpus_per_node 2 \ --vllm_num_engines 2 \ --vllm_tensor_parallel_size 2 \ --colocate_critic_reward \ --colocate_actor_ref \ --ref_reward_offload \ --pretrain meta-llama/Meta-Llama-3-8B-Instruct \ --reward_pretrain meta-llama/Meta-Llama-3-8B-Instruct \ --save_path /openrlhf/examples/test_scripts/ckpt/llama_ray \ --micro_train_batch_size 4 \ --train_batch_size 128 \ --micro_rollout_batch_size 16 \ --rollout_batch_size 1024 \ --max_epochs 1 \ --prompt_max_len 1024 \ --generate_max_len 1024 \ --zero_stage 3 \ --bf16 \ --actor_learning_rate 5e-7 \ --critic_learning_rate 9e-6 \ --init_kl_coef 0.01 \ --prompt_data Open-Orca/OpenOrca \ --prompt_data_probs 1.0 \ --max_samples 50000 \ --normalize_reward \ --adam_offload \ --flash_attn \ --gradient_checkpointing