According to the README, multi-node training is both possible just using slurm and also by using slurm + ray. For my specific circumstances on my HPC cluster, I only managed to make the slurm version without ray work for now (using conda).
Now I am wondering what I am losing by not using Ray? Are significant performance differences?
According to the README, multi-node training is both possible just using slurm and also by using slurm + ray. For my specific circumstances on my HPC cluster, I only managed to make the slurm version without ray work for now (using conda).
Now I am wondering what I am losing by not using Ray? Are significant performance differences?