[Feature] Multi-Node Batch Support

leedahl commented 1 year ago

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

Looking at AWS Batch and the support Batch has for Multi-Nodes, I was wondering if perhaps there may be a way to incorporate into the Ray Cluster Configuration support to use the Mulit-Node Batch jobs. What I find appealing with this is that if I were to terminate the head node Batch Job, all the worker nodes would terminate as well. The other appealing aspect is that AWS Batch has a Step Function integration. It would be nice to invoke a Batch Job that would spin up a mult-node cluster for Ray to use. With the limited tests I have done, I liked how if you abort a step function execution, the Multi-Node Batch job is terminated and all the nodes (head and worker) terminate.

I am happy to work on the code and submit a Pull Request, But I would need some help from the Multi-Node Batch team at AWS as there are a few questions about the Multi-Nodes Batch service that i am not real familar with yet.

Use case

Our basic use case is in imagery processing. We use Ray in a producer/consumer model to spin up consumers across a cluster of compute instances to process large blocks of imagery in parallel. The current issue is clean up when things go wrong. There are two ways things go wrong currently (the ray cluster has worker nodes that die and leave the cluster in a bad state, and something happens with logging in ray and the jobs never end even though the processing is complete). When these things happen we have defensive services that try to cleanup. These defensive services need to do a lot because the Ray cluster isn't a managed service. Running in Batch, which is a managed service makes clean up easier. We can terminate a batch job and the step function retry logic will spin up a new batch job and retry. In a Multi-Node scenario, it would spin up a new cluster and retry.

Related issues

No response

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

rsmith013 commented 1 year ago

I would also be interested in this.

I thought this would be straightforward, spin up the nodes with batch and collection the IPs to build a cluster of resources, but it looks like Ray is designed differently and the pattern would be to start a cluster then submit a job to that cluster and terminate the cluster when that job is done.

The patterns for using kubernetes and VMs make more sense, but I am still interested if this is possible.

leedahl commented 1 year ago

So, I have done a little more experimenting with ray and multi-node batch and step functions. I am using python apps for working with the ray cluster. I have created a cluster in a step function. On the head node I call a process that invokes the ray shell command to start a head node (ray start --head). On the workers I do something similar by starting ray workers that attach to the head. The python code that starts the head node also monitors the number of nodes through the Ray nodes interface. It also starts up queues and consumers and does the processing that I need. I have the consumers grow as the number of nodes come on line. This works okay, however the one draw back is that the batch / step integration doesn't support the nodeOverrides of the multi-node batch definition. Thus, I can't override the number of nodes to start. I get around this by specifying separate job definitions that differ only by the number of nodes. Then I use a choice statement in the state machine that determines which batch job to start. This is not ideal but it works.

I still think that there could be a case made for specify a cluster size in the Ray Cluster config and having ray initiate a multi-node batch job. Then an application could connect to the cluster and submit jobs.

amzn / amazon-ray