support remote rm and ref model api for ppo

catqaq commented 1 week ago

Simplify RLHF! Use a remote Reward Model (RM) and Reference model, break free from the constraints of the RM and reference model, and focus solely on training the actor and critic models. This significantly reduces the GPU memory requirements and further unleashes the potential of OpenRLHF!

Main modification points:

remote rm/ref api: openrlhf/utils/utils_for_api.py
ppo related:
- examples/train_ppo.py
- openrlhf/models/model.py
- openrlhf/trainer/ppo_trainer.py
- openrlhf/trainer/ppo_utils/experience_maker.py
- openrlhf/utils/deepspeed.py
sh script
others：
- avoid conflicts: openrlhf/utils/logging.py -> openrlhf/utils/logging_utils.py
- openrlhf/trainer/ray/vllm_worker_wrap.py

Note, this is a simple implementation I hastily put together tonight. If there are no issues and it doesn't affect other modules, it can be merged first; further optimization can be done later when time permits, as I may not have more time available in the near term.

catqaq commented 1 week ago

I have just passed the test on version 0.2.6, but I am still testing on the latest version.

hijkzzz commented 1 week ago

I can understand the significance of remote RM (such as for Nemotron 340B RM), but it seems that a remote Reference Model doesn't make much sense(Ray PPO already supported deploying Ref on other nodes). Additionally, there could be issues with implementing the Reference Model, because if the tokenizers and base models on both sides are not aligned, it could cause errors with the calculation of the advantages/loss function.

catqaq commented 1 week ago

I can understand the significance of remote RM (such as for Nemotron 340B RM), but it seems that a remote Reference Model doesn't make much sense. Additionally, there could be issues with implementing the Reference Model, because if the tokenizers and base models on both sides are not aligned, it could cause errors with the calculation of the advantages/loss function.

Yes, the use of a remote reference model needs to be approached with caution, typically requiring the same tokenizer as the actor model. During testing, I used the same model. However, in fact, there are some interesting things that can be done here, but they are not currently a high priority. Therefore, for the time being, I have considered it as an optional feature, as there are indeed some risks associated with direct use due to tokenizer issues and the consistency of encoding and decoding.

hijkzzz commented 1 week ago

Use Ray to use low-compute machines for reference model:

In Ray, you can control task scheduling by specifying the node's IP address. Ray allows you to specify resource constraints when submitting tasks, and you can use custom resource labels to help the scheduler select the appropriate node. Here is a basic example demonstrating how to use Ray's resource labels and IP addresses to specify nodes:

Start Ray on the nodes:

When starting Ray on each node, you can specify custom resource labels. For example:

On machine A (with small memory):

ray start --node-ip-address=<IP of machine A> --resources '{"small_memory": 1}'

On machine B (with large memory):

ray start --node-ip-address=<IP of machine B> --resources '{"large_memory": 1}'

Specify resource requirements in your script:

When submitting tasks, you can specify the resources required for the task. For example:

import ray

ray.init(address='auto')  # Connect to the Ray cluster

@ray.remote(resources={"small_memory": 1})
def task1():
    # Task suitable for small memory
    pass

@ray.remote(resources={"large_memory": 1})
def task2():
    # Task suitable for large memory
    pass

# Submit tasks
result1 = task1.remote()
result2 = task2.remote()

In this example, task1 will be scheduled on a node with the small_memory resource (i.e., machine A), and task2 will be scheduled on a node with the large_memory resource (i.e., machine B).

This method allows you to achieve node scheduling through resource labels without directly using IP addresses. If you really need to specify the node IP directly, you could consider combining Ray's Node API, but this approach is generally not recommended as it breaks the abstraction and flexibility of task scheduling.

See OpenRLHF tutorial: https://openrlhf.readthedocs.io/en/latest/faq.html#how-to-specifies-the-nodes-allocation-for-the-models-in-ray-ppo

zaemyung commented 5 days ago

@hijkzzz Could you provide more example codes for using Ray for the cases where a reward model is served by one GPU on the same single node? More specifically, where should we modify to support API-based calls?

hijkzzz commented 5 days ago

@hijkzzz Could you provide more example codes for using Ray for the cases where a reward model is served by one GPU on the same single node? More specifically, where should we modify to support API-based calls?

I'm not sure I get what you mean, Ray PPO already supports serving RM on one GPU on the same node. See https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_ppo_llama_ray.sh

Btw, if we want to deploy the Reference Model on a V100 x8 node (and other models on A100, etc.), we can modify the resources-related code in train_ppo_ray.py.

ray start --node-ip-address=<IP of machine A> --resources '{"v100": 8}'

# Modify
ref_model = PPORayActorGroup(
        args.ref_num_nodes,
        args.ref_num_gpus_per_node,
        ReferenceModelRayActor,
        pg=pg,
        num_gpus_per_actor=0.25 if pg else 1,
    )

# To
ref_model = PPORayActorGroup(
        args.ref_num_nodes,
        args.ref_num_gpus_per_node,
        ReferenceModelRayActor,
        pg=pg,
        num_gpus_per_actor=1,
        resources={"v100": 1}
        num_resources_per_node=8,
    )

zaemyung commented 5 days ago

@hijkzzz I see. Thanks for the pointer! 🙏🏻 I can see that essentially the reward model (nn.Module) will be deployed and later called for its forward method.

I think this is a different question, but my reward model is essentially a prompt-based one, wrapping an existing LLM (without the value head) with a custom class for prompting and constrained generation (more specifically I'm using Qwen loaded up using Ollama). In this case, what would be an appropriate way of interaction?

hijkzzz commented 5 days ago

@hijkzzz I see. Thanks for the pointer! 🙏🏻 I can see that essentially the reward model (nn.Module) will be deployed and later called for its forward method.

I think this is a different question, but my reward model is essentially a prompt-based one, wrapping an existing LLM (without the value head) with a custom class for prompting and constrained generation (more specifically I'm using Qwen loaded up using Ollama). In this case, what would be an appropriate way of interaction?

You can load it with vLLM, but you need to modify some code, refer to the vLLM Engine https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/vllm_engine.py.

@catqaq The remote reward model makes sense in this scenario, and it is recommended that this feature be implemented first (remote RM) and included in Ray PPO.

OpenLLMAI / OpenRLHF

support remote rm and ref model api for ppo #341