Open LinkyLiu opened 2 months ago
actor, critic, rm, init nodes = 1,1,1,1 with adam offload + gradient_checkpoint
@hijkzzz Thank you for replying! But I met this problem, do you know how to solve it ?
Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/bin/python3.10/dist-packages/ray/_private/worker.py", line 866, in get_objects
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: ActorModelRayActor
actor_id: 53688e714f4881c3b3028ed402000000
pid: 3752
namespace: f4c18cbd-bbfb-4d8b-acf3-3aa591111fe9
ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Do you have more detailed logs + running envs + launch commands?
i success on the following configuration:
` set -x export PATH=$HOME/.local/bin/:$PATH
ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \ -- python3 examples/train_ppo_ray.py \ --ref_num_nodes 1 \ --ref_num_gpus_per_node 1 \ --reward_num_nodes 1 \ --reward_num_gpus_per_node 1 \ --critic_num_nodes 1 \ --critic_num_gpus_per_node 1 \ --actor_num_nodes 1 \ --actor_num_gpus_per_node 1 \ --pretrain /root/.cache/huggingface/hub/llama-2-7b-chat-hf \ --reward_pretrain /root/.cache/huggingface/hub/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194 \ --save_path /openrlhf/examples/test_scripts/ckpt/7b_llama \ --micro_train_batch_size 2 \ --train_batch_size 128 \ --micro_rollout_batch_size 4 \ --rollout_batch_size 1024 \ --max_epochs 1 \ --prompt_max_len 1024 \ --generate_max_len 1024 \ --zero_stage 2 \ --bf16 \ --actor_learning_rate 5e-7 \ --critic_learning_rate 9e-6 \ --init_kl_coef 0.01 \ --prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward \ --prompt_data_probs 0.4,0.5,0.1 \ --max_samples 80000 \ --normalize_reward \ --actor_init_on_gpu \ --adam_offload \ --flash_attn \ --gradient_checkpointing \ --lora_rank 4 `
@LinkyLiu Ray actor has died unexpectedly, please check ray log in /tmp/ray/session_latest/logs/
: raylet.out, raylet.err, job-xxx.log. There should be more information about why the actor die.
Hello, I want to run train_ppo_llama_ray.sh on 4 RTX4090, should I modify the actor_num_gpus_per_node/critic_num_gpus_per_node in train_ppo_llama_ray.sh ? As the default script is for 8 gpus, what else should I pay attention to or should be modified?