OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.72k stars 161 forks source link

Documentation for using Kuberay #266

Open karthik-nexusflow opened 2 months ago

karthik-nexusflow commented 2 months ago

Hi Team, It would be great if kuberay commands to run openrlhf is added in the docs ,to make the cold start easier to set it up

karthik-nexusflow commented 2 months ago

You can also dump the commands you use / I can help with the docs from a user perspective , once I get it setup

wuxibin89 commented 2 months ago

@karthik-nexusflow Setup ray cluster and submit openrlhf job to ray cluster are 2 separate stages.

  1. To setup multi nodes ray cluster, there're plenty options depends on your infrastructure.
    • If you have already done ML workflow on kubernetes, then kuberay is the best option to launch ray cluster.
    • If you only have a few nodes(e.g 3~5), then manually start ray head and worker node is the simplest way.
      
      # start head node first
      ray start --head --port=6379 --node-ip-address=10.0.0.1

start worker node 1

ray start --node-ip-address=10.0.0.2 --address=10.0.0.1:6379

start worker node 2

ray start --node-ip-address=10.0.0.3 --address=10.0.0.1:6379

2. After your ray cluster is setup, then just submit openrlhf job to ray cluster dashboard like below:
```bash
ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \
    --no-wait \
    -- python3 examples/train_ppo_ray.py \
    ...

Stage 2 is independent on how you launch a ray cluster and you can launch multiple jobs to the same cluster.

karthik-nexusflow commented 2 months ago

Thank you ,

for 1. Kuberay it would be great you can share the docker file you are using

for 2 . setting up passwordless SSH has some issues on our cluster , is it stricly necessary for that , when you tried that method how did you go about it ?

hijkzzz commented 2 months ago

Thank you ,

for 1. Kuberay it would be great you can share the docker file you are using

for 2 . setting up passwordless SSH has some issues on our cluster , is it stricly necessary for that , when you tried that method how did you go about it ?

We have provided the vllm-based dockerfile https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile You could modify it based on that