Open karthik-nexusflow opened 2 months ago
You can also dump the commands you use / I can help with the docs from a user perspective , once I get it setup
@karthik-nexusflow Setup ray cluster and submit openrlhf job to ray cluster are 2 separate stages.
# start head node first
ray start --head --port=6379 --node-ip-address=10.0.0.1
ray start --node-ip-address=10.0.0.2 --address=10.0.0.1:6379
ray start --node-ip-address=10.0.0.3 --address=10.0.0.1:6379
2. After your ray cluster is setup, then just submit openrlhf job to ray cluster dashboard like below:
```bash
ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json='{"working_dir": "/openrlhf", "pip": "/openrlhf/requirements.txt"}' \
--no-wait \
-- python3 examples/train_ppo_ray.py \
...
Stage 2 is independent on how you launch a ray cluster and you can launch multiple jobs to the same cluster.
Thank you ,
for 1. Kuberay it would be great you can share the docker file you are using
for 2 . setting up passwordless SSH has some issues on our cluster , is it stricly necessary for that , when you tried that method how did you go about it ?
Thank you ,
for 1. Kuberay it would be great you can share the docker file you are using
for 2 . setting up passwordless SSH has some issues on our cluster , is it stricly necessary for that , when you tried that method how did you go about it ?
We have provided the vllm-based dockerfile https://github.com/OpenLLMAI/OpenRLHF/tree/main/dockerfile You could modify it based on that
Hi Team, It would be great if kuberay commands to run openrlhf is added in the docs ,to make the cold start easier to set it up