dstackai / dstack

dstack is an easy-to-use and flexible container orchestrator for running AI workloads in any cloud or data center.
https://dstack.ai
Mozilla Public License 2.0
1.22k stars 90 forks source link

Support multi-device TPU Pods #1337

Open r4victor opened 3 weeks ago

r4victor commented 3 weeks ago

1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.

Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.

Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.

r4victor commented 4 days ago

The most promising solution at the moment seems to be the instance-per-TPU-device model. Provisioning a multi-device TPU Pod creates an instance for each TPU device. For example, provisioning TPU v2-32 would create four TPU v2 instances. dstack pool ps shows each TPU device as a separate instance but instances can be grouped, e.g. by TPU Pod name. Users delete TPU Pods by specifying TPU Pod name (also can be the name of any instance in the Pod).

To support multi-node TPU tasks, dstack can determine nodes automatically based on the number of devices in the TPU Pod. For example, if a user specifies tpu-v2-32, dstack will run four jobs. An alternative solution would be to ask for a TPU type like tpu-v2 and determine the number of cores based on the number of jobs. The downside of the latter is that users won't be able to specify arbitrary number in nodes, so they'll need to calculate it depending on what TPU configuration they want to run.

Implementation details: