cubed-dev / cubed

Bounded-memory serverless distributed N-dimensional array processing
https://cubed-dev.github.io/cubed/
Apache License 2.0
96 stars 7 forks source link

Ray Executor #488

Open alxmrs opened 6 days ago

alxmrs commented 6 days ago

In addition to accelerator support (e.g. via #304), Cubed could benefit ML users by providing ray executor: https://docs.ray.io/en/latest/ray-core/walkthrough.html

Since Cubed is a serverless model, I bet it could get away with only using Tasks/remote functions.

From talking with @cromwellian a bit, my hope is that Cubed could provide memory bounds when trying to saturate GPUs during model training. I'm not totally sure exactly what a training loop with Cubed would look like. Here's how ray integrates with PyTorch, for example: https://docs.ray.io/en/latest/train/api/doc/ray.train.torch.TorchTrainer.html#ray.train.torch.TorchTrainer

@shoyer pointed out to me once the idea that GPU OOM errors occur while taking the gradient of a function graph, not necessarily on the forward pass. I'm not totally sure right now if Cubed is in fact a good fit for tackling this problem, only that the potential is exciting.

tomwhite commented 6 days ago

Thanks for opening this issue @alxmrs! I think Ray would be a great runtime for Cubed, and should be relatively straightforward to write an executor for (maybe a bit like the Modal one?). Do you know what people generally run Ray on in production/at scale?

alxmrs commented 6 days ago

Hey Tom! Do you mean what does the userbase look like, or do I know specific people? On the former: Ray is the engine that OpenAI uses to train its GPT models; it's really popular in the ML world. On the latter: Ray, the person (cromwellian), uses Ray, the framework, at Roblox for model training. :)

should be relatively straightforward to write an executor for (maybe a bit like the Modal one?).

I agree, and it does look like it will be similar to Modal.

tomwhite commented 6 days ago

I meant usage of Anyscale vs KubeRay vs ?? I was wondering if there was a choice that most people use, or whether it's a bit of everything.

Ray, the person (cromwellian), uses Ray, the framework, at Roblox for model training. :)

Got it!