OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.72k stars 160 forks source link

The configuration for Llama-7b on 4 RTX4090 #269

Open LinkyLiu opened 2 months ago

LinkyLiu commented 2 months ago

Hello, I want to run train_ppo_llama_ray.sh on 4 RTX4090, should I modify the actor_num_gpus_per_node/critic_num_gpus_per_node in train_ppo_llama_ray.sh ? As the default script is for 8 gpus, what else should I pay attention to or should be modified?

hijkzzz commented 2 months ago

actor, critic, rm, init nodes = 1,1,1,1 with adam offload + gradient_checkpoint

LinkyLiu commented 2 months ago

@hijkzzz Thank you for replying! But I met this problem, do you know how to solve it ?

Status message: Job entrypoint command failed with exit code 1, last available logs (truncated to 20,000 chars):
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/bin/python3.10/dist-packages/ray/_private/worker.py", line 866, in get_objects
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: ActorModelRayActor
        actor_id: 53688e714f4881c3b3028ed402000000
        pid: 3752
        namespace: f4c18cbd-bbfb-4d8b-acf3-3aa591111fe9
        ip: 0.0.0.0
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
hijkzzz commented 2 months ago

Do you have more detailed logs + running envs + launch commands?

libowen424 commented 2 months ago

i success on the following configuration:

` set -x export PATH=$HOME/.local/bin/:$PATH

ray job submit --address="http://127.0.0.1:8265" \ --runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \ -- python3 examples/train_ppo_ray.py \ --ref_num_nodes 1 \ --ref_num_gpus_per_node 1 \ --reward_num_nodes 1 \ --reward_num_gpus_per_node 1 \ --critic_num_nodes 1 \ --critic_num_gpus_per_node 1 \ --actor_num_nodes 1 \ --actor_num_gpus_per_node 1 \ --pretrain /root/.cache/huggingface/hub/llama-2-7b-chat-hf \ --reward_pretrain /root/.cache/huggingface/hub/models--OpenLLMAI--Llama-2-7b-rm-anthropic_hh-lmsys-oasst-webgpt/snapshots/a982afeed00fac9767d53aecde5b88947b1be194 \ --save_path /openrlhf/examples/test_scripts/ckpt/7b_llama \ --micro_train_batch_size 2 \ --train_batch_size 128 \ --micro_rollout_batch_size 4 \ --rollout_batch_size 1024 \ --max_epochs 1 \ --prompt_max_len 1024 \ --generate_max_len 1024 \ --zero_stage 2 \ --bf16 \ --actor_learning_rate 5e-7 \ --critic_learning_rate 9e-6 \ --init_kl_coef 0.01 \ --prompt_data Open-Orca/OpenOrca,Dahoas/full-hh-rlhf,tasksource/oasst1_pairwise_rlhf_reward \ --prompt_data_probs 0.4,0.5,0.1 \ --max_samples 80000 \ --normalize_reward \ --actor_init_on_gpu \ --adam_offload \ --flash_attn \ --gradient_checkpointing \ --lora_rank 4 `

wuxibin89 commented 2 months ago

@LinkyLiu Ray actor has died unexpectedly, please check ray log in /tmp/ray/session_latest/logs/: raylet.out, raylet.err, job-xxx.log. There should be more information about why the actor die.