Multi-node training. Slurm vs Slurm + Ray

OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)

https://openrlhf.readthedocs.io/

Apache License 2.0

1.71k stars 160 forks source link

Multi-node training. Slurm vs Slurm + Ray #337

Closed yannikkellerde closed 2 days ago

yannikkellerde commented 3 days ago

According to the README, multi-node training is both possible just using slurm and also by using slurm + ray. For my specific circumstances on my HPC cluster, I only managed to make the slurm version without ray work for now (using conda).

Now I am wondering what I am losing by not using Ray? Are significant performance differences?

hijkzzz commented 3 days ago

With Ray + vllm you can train bigger models, and faster